Aim

This page is currently out of date - fix up after LCA

The page tables in Linux are implemented as an open coded multi level page table (MLPT). The purpose of the page table interface (PTI) project is to define an implementation independent interface to operate on the page tables. This will enable experimentation with different page table structures more suited to the 64 bit address space. The MLPT will remain the DEFAULT page table in Linux and will be referred to as the default page table.

The page table interface has been developed by abstracting the current page table implementation (the default MLPT) behind an independent interface. The interface is explained below from the perspective of the default MLPT.

Patches moving the MLPT behind the new interface can be found at the Gelato CVS repository for the 2.6.17.3 kernel at:

http://www.gelato.unsw.edu.au/IA64wiki/cvs/cvs/kernel/page_table_interface (Link up soon)

For further information please contact PaulDavies at Gelato@UNSW.

INSTRUCTIONS
1) Unpack the 2.6.17.3 kernel.
2) Apply the patch pti patch series to push the architecture independent interface
3) Apply the patch pti-ia64-2.6.17-rc4.patch to push the IA64 specific part of the page table interface.
4) Compile for an IA64 and run the kernel on an IA64.

The patched kernel will have the same functionality as the unpatched kernel with almost no deterioration in performance (refer to benchmarks). The kernel is ready for a different page table implementations to be plugged into its new page table interface.

A patch to choose the default MLPT, and move the MLPT abstraction into its own implementation specific location (to logically separate it from other page table implementations) is explained at MoveMLPT and found in the Gelato CVS repository.

Detailed information regarding the virtues of changing page tables can be found at:

http://www.gelato.unsw.edu.au/IA64wiki/VariableRadixPageTables

An explanation of which architectures may be well suited to changing page tables can be found here:

The Page Table Interface

The page table interface is composed of two parts. The first part is the architecture independent interface. All architecture independent page table operations are performed through this interface. The cost of running through this interface is negligible for the major architectures (i386/IA64/powerPC). The second part of the interface is for architecture specific page table operations.

The architecture independent interface is contained in the include/linux/ directory.

include/linux/default-pt.h contains the general operations that can be performed on the default page table (MLPT). This includes

The other basic operation that is performed on the page tables is iteration. The open coding (direct accessing and assumption of implementation) of the MLPT has allowed for many tailored iterations within the VM system. These iterators have been individually abstracted for the MLPT and are broken into read, build and dual iterators.

include/mm/mlpt-read-iterators.h contain the read iterators for the MLPT. Read iterators are passed an address range and a function pointer. Each PTE that exists within this range is visited contiguously and the callback operates on the PTE accordingly.

include/mm/mlpt-build-iterators.h contain the build iterators for the MLPT. Build iterators are passed an address range and a function pointer. The page table is built for each address in the range. The callback operates on each PTE accordingly

include/mm/mlpt-dual-iterators.h contain the dual iterators for the MLPT. This reads a source page table and builds the destination page table as it iterates through the address range.

include/asm/pgalloc.h contains the memory allocation functions for the MLPT.

A number of additional files have been added to architecture independent linux code to contain abstracted code for the MLPT.

The architecture dependent interface for IA64 is contained in /arch/ia64/mm/mlpt.h

Added four macros to PTI to shift locking inside the page table implementation

typedef struct pt_struct { pmd_t *pmd; } pt_path_t; /* Partial path */

These guys are worker functions for the PTI.
lock_pte(mm, pt_path) - lock the pte pointed to by the previously filled path
unlock_pte(mm, pt_path) - unlock the pte pointed to by the previously fille path
get_pte_lock(mm, pt_path, address) - get pte from a partial path(which may be partial) and lock it
atomic_pte_same(mm, pte, orig_pte, pt_path) - Check the pte pointed to by pte and the original pte has not changed (need the path to provide atomicity)
SHIFT THESE. Now we no longer have mlpt - we just call if the default page table.

NOTE - to self. Have a quick chat about fastcall on get_locked_pte with Adam

Page table initialisation

static inline int create_user_page_table(struct mm_struct *mm)

This function creates and initialises a page table. It is called in fork.c to initialise the page table for a newly forked user process. For the MLPT implementation, an initialised page table comprises a zeroed out pgd directory ONLY. It returns 1 on success and 0 on failure (out of memory).TESTED.

static inline void create_kernel_page_table(void)

This function 'creates' and initialises the kernel page table. This compiles out for the MLPT (a zeroed out pgd directory is provided hard coded in assembler for each architecture - known as the swapper_pg_dir). Other page table implementations that use alternative memory management for building page tables are non trivial and won't use swapper_pg_dir. This guy is in arch dep interface - move

Page table destruction

static inline void destroy_user_page_table(struct mm_struct *mm)

This function destroys the page table. It is called in fork.c to free the page table for a user process that is being destroyed. For the MLPT implementation, freeing the page table involves returning the allocated pgd directory to the kernel's quicklists. The pud, pmd, and pte directories have been returned to the kernel via free_pgtables prior to this function being called.TESTED.

Look up a page table

static inline pte_t *lookup_page_table(struct mm_struct *mm, unsigned long address, 
        spinlock_t **ptl)

This function looks up a page table (user of kernel) for a mapping. To look up the kernel page table the address of the init process must be passed. The address of the spinlock that covers the pte can be obtained if the lock is to be taken. Call with NULL if the lock is not to be taken. It is the responsibility to the caller to take out the spinlock.

static inline pte_t *lookup_gate_area(struct mm_struct *mm, unsigned long address)

This function looks up the the gate area of a page table. The location of the gate area varies with architecture.

Build a page table

static inline pte_t *build_page_table(struct mm_struct *mm, unsigned long address,
        spinlock_t **ptl)

This function builds a page table readying it for insertion. For the MLPT implementation, if pud/pmd/pte directories are required to add a particular user page table entry, then they are allocated, and pointers set accordingly. A spinlock will be taken out covering the pte if a variable to hold the spinlock address is provided (call with NULL and no spinlock will be taken out).

Tear down a page table

[Paul] - add info from Gorman explaining when this function is called.

The function free_pgtables tears down a page table between a range of addresses (floor to ceiling). The process address space is broken up into a list of linear regions (vmas) and free_pgtables traverses through this list of vmas, calling tear_down_pgtable_range on the relevant vma regions.

static inline void tear_down_pgtable_range(struct mmu_gather **tlb,
                        unsigned long addr, unsigned long end,
                        unsigned long floor, unsigned long ceiling)

addr and end represent the vma range to be torn down. The function coalesce_vmas is called in free_pgtables prior to calling tear_down_pgtable_range. This function creates the illusion of joining a number of vmas into one vma prior to calling tear_down_pgtable_range (done for optimisation purposes for the MLPT implementation).

static inline void coallesce_vmas(struct vm_area_struct **vma_p,
                struct vm_area_struct **next_p)

The page table is unused (but possibly previously built) in the address range to be torn down. For the MLPT, tearing down the page table means deallocating relevant pte/pud/pmd directories in the address range.

Dual Iterators

There are two customised dual iterators, contained in include/mm/mlpt-dual-iterators.h. A dual iterator builds a destination page table whilst iterating over a source page table. Two iterator implementations occur naturally for the 'same conceptual task' because of the different locking requirements.

static inline int copy_page_range_iterator(struct mm_struct *dst_mm, 
    struct mm_struct *src_mm, unsigned long addr, unsigned long end, 
    struct vm_area_struct *vma, pte_rw_iterator_callback_t func)

The copy page range iterator is called during fork and mmap. The source address space is contiguously duplicated to the destination address space for keys in the given range. The callback operates on each of the source and destination entries as it iterates. TESTED

static inline unsigned long move_page_tables(struct vm_area_struct *vma,
    unsigned long old_addr, struct vm_area_struct *new_vma,
    unsigned long new_addr, unsigned long len, mremap_callback_t func)

This iterator reads an address space and builds a section of the SAME address space (hence the different locking requirements). It is used by the mremap system call for expanding/shrinking memory mappings. TESTED

Read Iterators

There are seven customised read iterators, contained in include/mm/mlpt-read-iterators.h. A read iterator visits each entry that exists in the given address range and operates on a function that is passed to it.

static inline unsigned long unmap_page_range_iterator(struct mmu_gather *tlb,
        struct vm_area_struct *vma, unsigned long addr, unsigned long end,
        long *zap_work, struct zap_details *details, zap_pte_callback_t func)

The unmap iterator unmaps all page table entries in the given range and flushed the TLB. The page table remains built within this range (identically as before) except that all ptes in the range are now NULL (not present). The page table itself can now be torn down

static inline unsigned long msync_read_iterator(struct vm_area_struct *vma,
        unsigned long addr, unsigned long end, msync_callback_t func)

msync flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to disk within this range.

Build Iterators

Benchmarks

Architecture Independent Interface

The following benchmarks examine the degradation in performance of the Linux kernel on the major architectures as a result of accessing the Linux page tables through the architecture independent interface. It is critical that the page table interface minimizes degradation in performance of the Linux VM system for those architectures that will never change page tables away from the MLPT.

IA64 - running a 3 level MLPT with a standard 16K page. RAM 4G - 5 runs. Benchmark: LMbench 2.03

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
                                 null     null                       open    signal   signal    fork    execve  /bin/sh
kernel                           call      I/O     stat    fstat    close   install   handle  process  process  process
-----------------------------  -------  -------  -------  -------  -------  -------  -------  -------  -------  -------
2.6.17-rc3-vanilla               0.272  0.45354    2.439    0.545    4.784    0.550    2.810    113.5    645.1   3517.9
  s.d. (5 runs)                  0.000  0.00124    0.006    0.000    0.012    0.000    0.025      0.0      6.7     15.0
2.6.17-rc3-PTI                   0.272  0.45044    2.383    0.580    4.813    0.555    2.865    118.0    667.2   3592.8
  s.d. (5 runs)                  0.000  6.51920    0.012    0.006    0.036    0.000    0.009      0.0      6.4     12.6


File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
                          0K       0K       1K       1K       4K       4K      10K      10K     Mmap     Prot    Page
kernel                  Create   Delete   Create   Delete   Create   Delete   Create   Delete   Latency  Fault   Fault
----------------------- -------  -------  -------  -------  -------  -------  -------  -------  -------  ------  ------
2.6.17-rc3-vanilla        47.69    20.80    72.55    36.24    75.58    36.41    99.08    39.34   4183.2   1.327    1.00
  s.d.                     0.05     0.03     0.36     0.16     3.61     0.06     3.60     0.07     27.2   0.058    0.00
2.6.17-rc3                47.61    20.57    72.43    36.17    75.51    36.11    99.18    39.13   4413.0   1.308    1.00
  s.d.                     0.02     0.04     0.29     0.22     3.66     0.22     3.13     0.11     33.2   0.035    0.00

Summary: Fork 4.0% deterioration, execve 3.5% deterioration, mmap 5.5% deterioration.

i386 - running a 2 level MLPT with a 4K page.

386 results here.

Where I am losing my performance at the moment

The page fault handler is a very hot code path, sensitive to minor code changes and depends heavily on the organization of data structures. Cache line bouncing has a critical influence on page fault performance in SMP systems and becomes particularly significant for large applications (like huge databases or computational applications) that try to minimize startup time by having multiple threads of a process running on different processors in order to initialize their memory structures concurrently. http://www.kernel.org/pub/linux/kernel/people/christoph/

We need to rework the page fault handler abstraction. Unfortunately we are going to have to look at passing around a struct to get back performance.

General scribble

Zoltan Menyhart > What do you mean with "physical mode"?

"Not using any TLB entry (or any HW supported address translation stuff) to translate the data addresses before they go out of the CPU."

"Walking the page tables in physical mode is insensitive to any TLB purges, therefore these purges do not make sure that there is no other CPU just in the middle of page table walking."

Zoltans problem "There is a possibility that walking has already been started, but it has not been completed yet, when "free_pgtables()" runs."

Suggestions by Adam

This section is just to log ideas put forward by Adam

Testing issues

The PTI should not change the operation of the kernel (the abstracted kernel should be functionally equivalent to the original). Code is merely abstracted.

Testing the PTI begins with making the abstracted code execute. A good deal of the abstracted code is called regularly with the bad abstraction causing the kernel to crash immediately. However there is some code that requires some knowledge merely to get it to run. The following is a list of problems encountered during testing and their solutions.

## elilo configuration file generated by elilo 3
## limit the memory with append="mem=512K"
                                                                                                                                                            
delay=20
default=vmlinux
#append="console=ttyS0,115200"
append="mem=512M"
                                                                                
image=/tftpboot/gelato/tartufi/vmlinux
        label="vmlinux"
        root=/dev/sda4
        read-only

This avoids having to type it at the boot prompt

Rolling forward to 2.6.17-rc6

Reworking the Iterators in the PTI and other stuff

Tutti howto

ssh to crashme. console tutti to access tutti. tutti does not use ldap. ssh to tutti. Different password After console tutti. You will come up with a boot prompt. To get into the management thingy type CTRL T The type * rst to reboot

PTI - ia64 Patche

The non hack version.

NB: (notes to myself to fix)

Syncing with GPT

Since Adam wants the quicklist allocator framework kept -> we will have to abstract pgalloc slightly differently.

* Lose pt-pgtable.h

* I prefer the way I got Adam to do it originally.

Release for LCA 2.6.19-rc3

There are three patch series to be applied, PTI, LVHPT and GPT. They will be applied in that order.

PTI patch series

Patch 1 Shifting mlpt allocation functions from memory.c to pt-default.c

Patch 2 Clean up for include/asm-generic/pgtable.h

IA64wiki: PageTableInterface (last edited 2007-01-05 00:57:32 by PaulDavies)

Gelato@UNSW is sponsored by
the University of New South Wales National ICT Australia The Gelato Federation Hewlett-Packard Company Australian Research Council
Please contact us with any questions or comments.