This page is currently out of date - fix up after LCA

The page tables in Linux are implemented as an open coded multi level page table (MLPT). The purpose of the page table interface (PTI) project is to define an implementation independent interface to operate on the page tables. This will enable experimentation with different page table structures more suited to the 64 bit address space. The MLPT will remain the DEFAULT page table in Linux and will be referred to as the default page table.

The page table interface has been developed by abstracting the current page table implementation (the default MLPT) behind an independent interface. The interface is explained below from the perspective of the default MLPT.

Patches moving the MLPT behind the new interface can be found at the Gelato CVS repository for the kernel at: (Link up soon)

For further information please contact PaulDavies at Gelato@UNSW.

1) Unpack the kernel.
2) Apply the patch pti patch series to push the architecture independent interface
3) Apply the patch pti-ia64-2.6.17-rc4.patch to push the IA64 specific part of the page table interface.
4) Compile for an IA64 and run the kernel on an IA64.

The patched kernel will have the same functionality as the unpatched kernel with almost no deterioration in performance (refer to benchmarks). The kernel is ready for a different page table implementations to be plugged into its new page table interface.

A patch to choose the default MLPT, and move the MLPT abstraction into its own implementation specific location (to logically separate it from other page table implementations) is explained at MoveMLPT and found in the Gelato CVS repository.

Detailed information regarding the virtues of changing page tables can be found at:

An explanation of which architectures may be well suited to changing page tables can be found here:

The Page Table Interface

The page table interface is composed of two parts. The first part is the architecture independent interface. All architecture independent page table operations are performed through this interface. The cost of running through this interface is negligible for the major architectures (i386/IA64/powerPC). The second part of the interface is for architecture specific page table operations.

The architecture independent interface is contained in the include/linux/ directory.

include/linux/default-pt.h contains the general operations that can be performed on the default page table (MLPT). This includes

The other basic operation that is performed on the page tables is iteration. The open coding (direct accessing and assumption of implementation) of the MLPT has allowed for many tailored iterations within the VM system. These iterators have been individually abstracted for the MLPT and are broken into read, build and dual iterators.

include/mm/mlpt-read-iterators.h contain the read iterators for the MLPT. Read iterators are passed an address range and a function pointer. Each PTE that exists within this range is visited contiguously and the callback operates on the PTE accordingly.

include/mm/mlpt-build-iterators.h contain the build iterators for the MLPT. Build iterators are passed an address range and a function pointer. The page table is built for each address in the range. The callback operates on each PTE accordingly

include/mm/mlpt-dual-iterators.h contain the dual iterators for the MLPT. This reads a source page table and builds the destination page table as it iterates through the address range.

include/asm/pgalloc.h contains the memory allocation functions for the MLPT.

A number of additional files have been added to architecture independent linux code to contain abstracted code for the MLPT.

The architecture dependent interface for IA64 is contained in /arch/ia64/mm/mlpt.h

Added four macros to PTI to shift locking inside the page table implementation

typedef struct pt_struct { pmd_t *pmd; } pt_path_t;

These guys are worker functions for the PTI.
lock_pte(mm, pt_path) - lock the pte pointed to by the previously filled path
unlock_pte(mm, pt_path) - unlock the pte pointed to by the previously fille path
get_pte_lock(mm, pt_path, address) - get pte from a partial path(which may be partial) and lock it
atomic_pte_same(mm, pte, orig_pte, pt_path) - Check the pte pointed to by pte and the original pte has not changed (need the path to provide atomicity)
Now we no longer have mlpt - we just call if the default page table.

NOTE - to self. Have a quick chat about fastcall on get_locked_pte with Adam

Page table initialisation

static inline int create_user_page_table(struct mm_struct *mm)

This function creates and initialises a page table. It is called in fork.c to initialise the page table for a newly forked user process. For the MLPT implementation, an initialised page table comprises a zeroed out pgd directory ONLY. It returns 1 on success and 0 on failure (out of memory).TESTED.

static inline void create_kernel_page_table(void)

This function 'creates' and initialises the kernel page table. This compiles out for the MLPT (a zeroed out pgd directory is provided hard coded in assembler for each architecture - known as the swapper_pg_dir). Other page table implementations that use alternative memory management for building page tables are non trivial and won't use swapper_pg_dir. This guy is in arch dep interface - move

Page table destruction

static inline void destroy_user_page_table(struct mm_struct *mm)

This function destroys the page table. It is called in fork.c to free the page table for a user process that is being destroyed. For the MLPT implementation, freeing the page table involves returning the allocated pgd directory to the kernel's quicklists. The pud, pmd, and pte directories have been returned to the kernel via free_pgtables prior to this function being called.TESTED.

Look up a page table

static inline pte_t *lookup_page_table(struct mm_struct *mm, unsigned long address, 
        spinlock_t **ptl)

This function looks up a page table (user of kernel) for a mapping. To look up the kernel page table the address of the init process must be passed. The address of the spinlock that covers the pte can be obtained if the lock is to be taken. Call with NULL if the lock is not to be taken. It is the responsibility to the caller to take out the spinlock.

static inline pte_t *lookup_gate_area(struct mm_struct *mm, unsigned long address)

This function looks up the the gate area of a page table. The location of the gate area varies with architecture.

Build a page table

  • PAUL: get rid of the spinlock in interface - logically we need a better solution (same for lookup page table).

static inline pte_t *build_page_table(struct mm_struct *mm, unsigned long address,
        spinlock_t **ptl)

This function builds a page table readying it for insertion. For the MLPT implementation, if pud/pmd/pte directories are required to add a particular user page table entry, then they are allocated, and pointers set accordingly. A spinlock will be taken out covering the pte if a variable to hold the spinlock address is provided (call with NULL and no spinlock will be taken out).

Tear down a page table

[Paul] - add info from Gorman explaining when this function is called.

The function free_pgtables tears down a page table between a range of addresses (floor to ceiling). The process address space is broken up into a list of linear regions (vmas) and free_pgtables traverses through this list of vmas, calling tear_down_pgtable_range on the relevant vma regions.

static inline void tear_down_pgtable_range(struct mmu_gather **tlb,
                        unsigned long addr, unsigned long end,
                        unsigned long floor, unsigned long ceiling)

addr and end represent the vma range to be torn down. The function coalesce_vmas is called in free_pgtables prior to calling tear_down_pgtable_range. This function creates the illusion of joining a number of vmas into one vma prior to calling tear_down_pgtable_range (done for optimisation purposes for the MLPT implementation).

static inline void coallesce_vmas(struct vm_area_struct **vma_p,
                struct vm_area_struct **next_p)

The page table is unused (but possibly previously built) in the address range to be torn down. For the MLPT, tearing down the page table means deallocating relevant pte/pud/pmd directories in the address range.

Dual Iterators

There are two customised dual iterators, contained in include/mm/mlpt-dual-iterators.h. A dual iterator builds a destination page table whilst iterating over a source page table. Two iterator implementations occur naturally for the 'same conceptual task' because of the different locking requirements.

static inline int copy_page_range_iterator(struct mm_struct *dst_mm, 
    struct mm_struct *src_mm, unsigned long addr, unsigned long end, 
    struct vm_area_struct *vma, pte_rw_iterator_callback_t func)

The copy page range iterator is called during fork and mmap. The source address space is contiguously duplicated to the destination address space for keys in the given range. The callback operates on each of the source and destination entries as it iterates. TESTED

static inline unsigned long move_page_tables(struct vm_area_struct *vma,
    unsigned long old_addr, struct vm_area_struct *new_vma,
    unsigned long new_addr, unsigned long len, mremap_callback_t func)

This iterator reads an address space and builds a section of the SAME address space (hence the different locking requirements). It is used by the mremap system call for expanding/shrinking memory mappings. TESTED

Read Iterators

There are seven customised read iterators, contained in include/mm/mlpt-read-iterators.h. A read iterator visits each entry that exists in the given address range and operates on a function that is passed to it.

static inline unsigned long unmap_page_range_iterator(struct mmu_gather *tlb,
        struct vm_area_struct *vma, unsigned long addr, unsigned long end,
        long *zap_work, struct zap_details *details, zap_pte_callback_t func)

The unmap iterator unmaps all page table entries in the given range and flushed the TLB. The page table remains built within this range (identically as before) except that all ptes in the range are now NULL (not present). The page table itself can now be torn down

static inline unsigned long msync_read_iterator(struct vm_area_struct *vma,
        unsigned long addr, unsigned long end, msync_callback_t func)

msync flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to disk within this range.

Build Iterators


Architecture Independent Interface

The following benchmarks examine the degradation in performance of the Linux kernel on the major architectures as a result of accessing the Linux page tables through the architecture independent interface. It is critical that the page table interface minimizes degradation in performance of the Linux VM system for those architectures that will never change page tables away from the MLPT.

IA64 - running a 3 level MLPT with a standard 16K page. RAM 4G - 5 runs. Benchmark: LMbench 2.03

Processor, Processes - times in microseconds - smaller is better
                                 null     null                       open    signal   signal    fork    execve  /bin/sh
kernel                           call      I/O     stat    fstat    close   install   handle  process  process  process
-----------------------------  -------  -------  -------  -------  -------  -------  -------  -------  -------  -------
2.6.17-rc3-vanilla               0.272  0.45354    2.439    0.545    4.784    0.550    2.810    113.5    645.1   3517.9
  s.d. (5 runs)                  0.000  0.00124    0.006    0.000    0.012    0.000    0.025      0.0      6.7     15.0
2.6.17-rc3-PTI                   0.272  0.45044    2.383    0.580    4.813    0.555    2.865    118.0    667.2   3592.8
  s.d. (5 runs)                  0.000  6.51920    0.012    0.006    0.036    0.000    0.009      0.0      6.4     12.6

File create/delete and VM system latencies in microseconds - smaller is better
                          0K       0K       1K       1K       4K       4K      10K      10K     Mmap     Prot    Page
kernel                  Create   Delete   Create   Delete   Create   Delete   Create   Delete   Latency  Fault   Fault
----------------------- -------  -------  -------  -------  -------  -------  -------  -------  -------  ------  ------
2.6.17-rc3-vanilla        47.69    20.80    72.55    36.24    75.58    36.41    99.08    39.34   4183.2   1.327    1.00
  s.d.                     0.05     0.03     0.36     0.16     3.61     0.06     3.60     0.07     27.2   0.058    0.00
2.6.17-rc3                47.61    20.57    72.43    36.17    75.51    36.11    99.18    39.13   4413.0   1.308    1.00
  s.d.                     0.02     0.04     0.29     0.22     3.66     0.22     3.13     0.11     33.2   0.035    0.00

Summary: Fork 4.0% deterioration, execve 3.5% deterioration, mmap 5.5% deterioration.

i386 - running a 2 level MLPT with a 4K page.

386 results here.

Where I am losing my performance at the moment

The page fault handler is a very hot code path, sensitive to minor code changes and depends heavily on the organization of data structures. Cache line bouncing has a critical influence on page fault performance in SMP systems and becomes particularly significant for large applications (like huge databases or computational applications) that try to minimize startup time by having multiple threads of a process running on different processors in order to initialize their memory structures concurrently.

We need to rework the page fault handler abstraction. Unfortunately we are going to have to look at passing around a struct to get back performance.

General scribble

Zoltan Menyhart > What do you mean with "physical mode"?

"Not using any TLB entry (or any HW supported address translation stuff) to translate the data addresses before they go out of the CPU."

"Walking the page tables in physical mode is insensitive to any TLB purges, therefore these purges do not make sure that there is no other CPU just in the middle of page table walking."

Zoltans problem "There is a possibility that walking has already been started, but it has not been completed yet, when "free_pgtables()" runs."

Suggestions by Adam

This section is just to log ideas put forward by Adam

  • The gate page lookup is messy. The gate page is used for fast system calls. consider copying the gate page to the relevant page table and then using the one lookup function.
  • Investigate what happens with inlining versus not inlining. See how it effects the benchmarks. (see above about losing performance)

Testing issues

The PTI should not change the operation of the kernel (the abstracted kernel should be functionally equivalent to the original). Code is merely abstracted.

Testing the PTI begins with making the abstracted code execute. A good deal of the abstracted code is called regularly with the bad abstraction causing the kernel to crash immediately. However there is some code that requires some knowledge merely to get it to run. The following is a list of problems encountered during testing and their solutions.

  • The swapon and swapoff system calls: There exists an iterator in swapfile.c that is called due to the swapoff system call. The following message "Unable to find swap-space signature" was encountered despite swap being enabled in the config. The swap space was consequently not enabled. For some reason the mkswap was not being run prior to swapon. I changed configs to fix the problem.
  • When testing the PTI with LTP it is best to mimimise the system memory at boot to exacerbate the system tests without having to hand tweaking the tests. Of course hand tweaking is inevitable but its a good place to start.

## elilo configuration file generated by elilo 3
## limit the memory with append="mem=512K"

This avoids having to type it at the boot prompt

  • The automounter: Another config problem. The automounter does not shut down /home cleanly (compiling pristine kernel) on a reboot. Find out why. I can start and stop the automounter using ./etc/init.d/autofs stop and start.

Rolling forward to 2.6.17-rc6

  • The asm-generic/pgtable.h has changed.

Reworking the Iterators in the PTI and other stuff

  • The issues:
    • Function pointers. At the moment they serve only to slow the PTI down. Function pointers are used if we don't know what function we are going to need at run time. In the current PTI know what function to use and the function pointers are not necessary.
    • I can use function pointers to simplify the interface for non - performance critical iterators only (iterator reuse).
    • Why are function pointers so slow? Obviously, we have to jump to the relevant function being passed (theoretically we can have the function inlined at compile time if we know what is to be called). Are there more reasons. NB Paul: Talk to ADAM about this.
    • I will use them to simplify the interface for non performance critical iterators only.
    • Think about consequences of moving implementation back to C files. Which parts of the impementation can be moved back to C files.
    • Get the page table lock out of mm_struct.
      • Leave this until we start reworking the locking, it is low priority. Also what does it have to do with reworking the PTI itereators? - AdamWiggins

    • ABI is the application binary interface and GP is the global pointer.

Tutti howto

ssh to crashme. console tutti to access tutti. tutti does not use ldap. ssh to tutti. Different password After console tutti. You will come up with a boot prompt. To get into the management thingy type CTRL T The type * rst to reboot

PTI - ia64 Patche

  • NB: To be applied in the following order:
    • PTI arch independent
    • PTI arch-ia64
    • Enable more than just the default-pt patch series
    • GPT patch series / LVHPT

The non hack version.

  • Introduce include/asm-ia64/pt.h
  • move guts of arch/asm-ia64/mm/ia64-default-pt.h to include/asm-ia64/pt-default.h
  • Added arch/asm-ia64/mm/pt-efault.c

NB: (notes to myself to fix)

  • Another lookup has appeared in discontig.c - add this.
  • pgalloc called in asm-generic/tlb.h - abstract to pt-tlb.h
  • missed a pgalloc in bin-elf something or other.
  • fix up page.h and pgtable.h and we are done with IA64 patches.
    • Added include/asm-ia64/pt-types.h for page table types.
  • go back and add the dynamic create_kernel_page_table needed for GPT.
  • Forgot the new abstraction in discontig.c
  • Still have to fix up the pg_nopud stuff.
  • Missed an abstraction in try_to_ummap_cluster
  • drm_follow_page. (direct render manager). We have a driver directly accessing kernel page table here.

Syncing with GPT

  • Planned changes to PTI to improve for GPT.
    • Lose the macros in ivt.h and put the lookup back into ivt.S
    • Revisit some of the file naming. eg. mm-pt.h. I will make it fit with the way Adam is working it for current GPT
    • Put the memory allocation of pages back in. Adam is planning on reusing the quicklist mechanism.
    • Must figure out why lookup isn't working on tartufi.

Since Adam wants the quicklist allocator framework kept -> we will have to abstract pgalloc slightly differently.

* Lose pt-pgtable.h

  • include pgtable-gpt.h or pgtable-default.h

* I prefer the way I got Adam to do it originally.

Release for LCA 2.6.19-rc3

There are three patch series to be applied, PTI, LVHPT and GPT. They will be applied in that order.

PTI patch series

  • Arch independent interface
    • Move MLPT into its own file - minus iterators, into its own file.
    • Abstract page fault handlers.

Patch 1 Shifting mlpt allocation functions from memory.c to pt-default.c

Patch 2 Clean up for include/asm-generic/pgtable.h

  • Shift PTI for LCA to its own page.
    • Document the config file.

IA64wiki: PageTableInterface (last edited 2009-12-10 03:14:02 by localhost)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.