The Problem
x86 and x86_64 have a project called User-Mode Linux that allows the linux kernel to run in userspace --- essentially, using Linux as a hypervisor for a paravirtualised Linux
I set out to do the same for IA64, but starting from the vNUMA virtual machine, and using AfterBurning (automatic previrtualisation.
Getting the code
You can get the latest and greatest snapshot (that doesn't always work) from the CVSRepository, or get the latest releases from ERTOS website Linux On Linux page.
How it Works
Each virtual processor is represented by a single process on the host. The virtual machine monitor (VMM, or hypervisor) lives in low memory; when it is started it sets up signal handlers for SIGSEGV, SIGILL and other relevant signals. When the guest kernel performs a privilleged operation, the host kernel generates a SIGILL(illegal operation signal) that is delivered to the hypervisor, which can then emulate the operation in terms of virtual processor state.
Signal handling is relatively slow, and when added to the time to disassemble the instruction, becomes prohibitively slow. Therefore we use afterburning; a technique whereby instructions can be recognised and replaced at asembly language time.
Memory Model
The guest runs in userspace, so its memory map needs to be rearranged to fit into what's left over after the host kernel steals regions 5 6 and 7.
Region Number |
Address range |
Host Use |
Guest use |
7 |
0xFFFFFFFFFFFF0000-0xFFFFFFFFFFFFFFFF |
Per CPU page |
Unavailable |
7 |
0xE000000000000000-0xFFFFFFFFFFFF0000 |
Identity mapped, cached |
Unavailable |
6 |
0xC000000000000000-0xDFFFFFFFFFFFFFFF |
Identity mapped, uncached |
Unavailable |
5 |
0xA000000100000000-... |
Kernel text and data |
Unavailable |
5 |
0xA000000000000000-0xA00000000001FFFF |
Gate pages |
Used for syscalls to host |
4 |
0x8000000100000000-... |
HugeTLBFS (If configured) |
Kernel text and data |
4 |
0x8000000000000000-0x800000000001FFFF |
HugeTLBFS (if configured) |
Gate pages |
3 |
0x6000000000000000-0x7FFFFFFFFFFFFFFF |
User mappings (e.g., stack, shared libraries) |
User mappings (e.g., stack, shared libraries) |
2 |
0x4000000000000000-0x5FFFFFFFFFFFFFFF |
User mappings (e.g., shared libraries, data) |
User mappings (e.g., shared libraries, data) |
1 |
0x2000000000000000-0x3FFFFFFFFFFFFFFF |
User mappings (e.g., rogram text) |
User mappings (e.g., program text) |
0 |
0x0000000200000000-... |
IA32 compat |
VMM text+data+stack |
0 |
0x0000000000080000-0x00000001FFFFFFFF |
IA32 Compat |
physical memory |
0 |
0x0000000000050000-0x000000000007FFFF |
IA32 compat |
Unmapped |
0 |
0x0000000000040000-0x000000000004FFFF |
IA32 Compat |
Virtual CPU |
0 |
0x0000000000020000-0x000000000003FFFF |
IA32 compat |
unmapped |
0 |
0x0000000000010000-0x000000000001FFFF |
IA32 compat |
Per-cpu page |
0 |
0x0000000000000000-0x000000000000FFFF |
Unmapped |
Unmapped |
The Physical memory is mapped from a file in /tmp. The VMM maintains a cache of virtual TLB entries (its page table). When there's a page fault (signalled by SIGSEGV) it looks up the virtual address in its page tables and maps the appropriate part of the physical memory to the virtual address that faulted.
When the guest kernel writes its region registers, all the mappings for that region should change. We tried a number of different optimisations to try to speed this process.
- Remove all mappings and clear page tables for a region whenever the region register changes. Allow pagefaults to reenter the guest as TLB misses for it to deal with. This is the slowest. The VMM is emulating a TLB without address space IDs.
Keep page tables around as a cache for a number of different region register address space IDs. When the region register changes, unmap the region and switch which page table is being used. When page faults occur (SIGSEGVs) map from the current page table. Thus the VMM is emulating a TLB with address space IDs.
As for 2, but eagerly remap the region as soon as the region register is changed. By avoiding signal overhead, this is a clear winner.
As for 3, but also maintain in the host kernel a small cache of address spaces. When region register 1 is changed, swap the entire address space with all its mappings. This optimisation gave the best performance.
Networking
The virtual machine expects the guest to use the HP Simulated ethernet driver. Packets sent/received on this are redirected to the tap? interface. The simplest way to get you virtual machine onto the network is to configure the host's primary interface as an ethernet bridge. On Debian, put this into /etc/network/interfaces
auto br0
iface br0 inet dhcp
bridge_ports eth0
(assuming eth0 is your primary ehternet interface). Then you can attach the appropriate tapN to the same bridge with
brctl addif br0 tap1
The runvmm script does this for you automatically. If you don't have spare addresses on your network (for instance, if it's controlled by the BOFH), then you can configure your host as a NATting firewall; this allows the guest to see out, but does not allow inward connexions. Again, runvmm can do this for you if you wish.
Disk I/O
The guest uses the Hp Simulated scsi device to talk to the VMM. The VMM opens/closes files and reads/writes them in response to SCSI commands. This is currently not particularly secure; a malicious kernel could open any file on the host system.
Next Release
Features
- Massively faster --- now only around 12% slower than native for kernel compilation benchmark.
- No longer any need for external bootloader.
- Much more robust.
- Easier to install.
Future Work
Combine with UserLevelDrivers work to allow direct access to (virtualised) PCI bus from the guest.
- Security improvements: drop privilege after setup in VMM, so that the user who invokes the guest cannot access files that don't belong to him/her.
- Performance measurement and analysis. How can the virtualisation overhead be reduced even more?
vNUMA reinstatement. The codebase derived from an early version of vNUMA; can the network DSM and NUMA features from vNUMA be reintegrated to provide a userspace NUMA emulation?
GDB friendliness. The VMM and guest use the break instruction extensively; but gdb wants to use SIGTRAP exclusively. Can we rearrange things to use a different trap method to context switch to the VMM?
Change Log
- multiple address spaces in one process speeds context switchy time by an order of magnitude; reduces overall virtualisation overhead to around 10--12% on UP.
- If the host kernel doesn't support multi-as, the fallback code now works properly.
- Guest gets right value of processor speed
- 'runvmm script updated to cope with no need for bootloader
- runvmm script removes temp files unless asked to leave them.
- ertos.nicta.com.au virtualisation pages updated
- Man page and README updated.
Known Problems
- If the guest OOMs (out-of-memory) it sometimes takes the hypervisor with it.
- Sublinear scaling on SMP/NUMA machines
