The Problem

x86 and x86_64 have a project called User-Mode Linux that allows the linux kernel to run in userspace --- essentially, using Linux as a hypervisor for a paravirtualised Linux

I set out to do the same for IA64, but starting from the vNUMA virtual machine, and using AfterBurning (automatic previrtualisation.

Getting the code

You can get the latest and greatest snapshot (that doesn't always work) from the CVSRepository, or get the latest releases from ERTOS website Linux On Linux page.

How it Works

Each virtual processor is represented by a single process on the host. The virtual machine monitor (VMM, or hypervisor) lives in low memory; when it is started it sets up signal handlers for SIGSEGV, SIGILL and other relevant signals. When the guest kernel performs a privilleged operation, the host kernel generates a SIGILL(illegal operation signal) that is delivered to the hypervisor, which can then emulate the operation in terms of virtual processor state.

Signal handling is relatively slow, and when added to the time to disassemble the instruction, becomes prohibitively slow. Therefore we use afterburning; a technique whereby instructions can be recognised and replaced at asembly language time.

Memory Model

The guest runs in userspace, so its memory map needs to be rearranged to fit into what's left over after the host kernel steals regions 5 6 and 7.

Region Number

Address range

Host Use

Guest use

7

0xFFFFFFFFFFFF0000-0xFFFFFFFFFFFFFFFF

Per CPU page

Unavailable

7

0xE000000000000000-0xFFFFFFFFFFFF0000

Identity mapped, cached

Unavailable

6

0xC000000000000000-0xDFFFFFFFFFFFFFFF

Identity mapped, uncached

Unavailable

5

0xA000000100000000-...

Kernel text and data

Unavailable

5

0xA000000000000000-0xA00000000001FFFF

Gate pages

Used for syscalls to host

4

0x8000000100000000-...

HugeTLBFS (If configured)

Kernel text and data

4

0x8000000000000000-0x800000000001FFFF

HugeTLBFS (if configured)

Gate pages

3

0x6000000000000000-0x7FFFFFFFFFFFFFFF

User mappings (e.g., stack, shared libraries)

User mappings (e.g., stack, shared libraries)

2

0x4000000000000000-0x5FFFFFFFFFFFFFFF

User mappings (e.g., shared libraries, data)

User mappings (e.g., shared libraries, data)

1

0x2000000000000000-0x3FFFFFFFFFFFFFFF

User mappings (e.g., rogram text)

User mappings (e.g., program text)

0

0x0000000200000000-...

IA32 compat

VMM text+data+stack

0

0x0000000000080000-0x00000001FFFFFFFF

IA32 Compat

physical memory

0

0x0000000000050000-0x000000000007FFFF

IA32 compat

Unmapped

0

0x0000000000040000-0x000000000004FFFF

IA32 Compat

Virtual CPU

0

0x0000000000020000-0x000000000003FFFF

IA32 compat

unmapped

0

0x0000000000010000-0x000000000001FFFF

IA32 compat

Per-cpu page

0

0x0000000000000000-0x000000000000FFFF

Unmapped

Unmapped

The Physical memory is mapped from a file in /tmp. The VMM maintains a cache of virtual TLB entries (its page table). When there's a page fault (signalled by SIGSEGV) it looks up the virtual address in its page tables and maps the appropriate part of the physical memory to the virtual address that faulted.

When the guest kernel writes its region registers, all the mappings for that region should change. We tried a number of different optimisations to try to speed this process.

  1. Remove all mappings and clear page tables for a region whenever the region register changes. Allow pagefaults to reenter the guest as TLB misses for it to deal with. This is the slowest. The VMM is emulating a TLB without address space IDs.
  2. Keep page tables around as a cache for a number of different region register address space IDs. When the region register changes, unmap the region and switch which page table is being used. When page faults occur (SIGSEGVs) map from the current page table. Thus the VMM is emulating a TLB with address space IDs.

  3. As for 2, but eagerly remap the region as soon as the region register is changed. By avoiding signal overhead, this is a clear winner.

  4. As for 3, but also maintain in the host kernel a small cache of address spaces. When region register 1 is changed, swap the entire address space with all its mappings. This optimisation gave the best performance.

Networking

The virtual machine expects the guest to use the HP Simulated ethernet driver. Packets sent/received on this are redirected to the tap? interface. The simplest way to get you virtual machine onto the network is to configure the host's primary interface as an ethernet bridge. On Debian, put this into /etc/network/interfaces

auto br0
iface br0 inet dhcp
        bridge_ports eth0

(assuming eth0 is your primary ehternet interface). Then you can attach the appropriate tapN to the same bridge with

brctl addif br0 tap1

The runvmm script does this for you automatically. If you don't have spare addresses on your network (for instance, if it's controlled by the BOFH), then you can configure your host as a NATting firewall; this allows the guest to see out, but does not allow inward connexions. Again, runvmm can do this for you if you wish.

Disk I/O

The guest uses the Hp Simulated scsi device to talk to the VMM. The VMM opens/closes files and reads/writes them in response to SCSI commands. This is currently not particularly secure; a malicious kernel could open any file on the host system.

Next Release

Features

Future Work

Change Log

Known Problems

IA64wiki: LinuxOnLinux (last edited 2009-12-10 03:13:49 by localhost)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.