Contents
Aim
This page is currently out of date - fix up after LCA
The page tables in Linux are implemented as an open coded multi level page table (MLPT). The purpose of the page table interface (PTI) project is to define an implementation independent interface to operate on the page tables. This will enable experimentation with different page table structures more suited to the 64 bit address space. The MLPT will remain the DEFAULT page table in Linux and will be referred to as the default page table.
The page table interface has been developed by abstracting the current page table implementation (the default MLPT) behind an independent interface. The interface is explained below from the perspective of the default MLPT.
Patches moving the MLPT behind the new interface can be found at the Gelato CVS repository for the 2.6.17.3 kernel at:
http://www.gelato.unsw.edu.au/IA64wiki/cvs/cvs/kernel/page_table_interface (Link up soon)
For further information please contact PaulDavies at Gelato@UNSW.
INSTRUCTIONS
1) Unpack the 2.6.17.3 kernel.
2) Apply the patch pti patch series to push the architecture independent interface
3) Apply the patch pti-ia64-2.6.17-rc4.patch to push the IA64 specific part of the page table interface.
4) Compile for an IA64 and run the kernel on an IA64.
The patched kernel will have the same functionality as the unpatched kernel with almost no deterioration in performance (refer to benchmarks). The kernel is ready for a different page table implementations to be plugged into its new page table interface.
A patch to choose the default MLPT, and move the MLPT abstraction into its own implementation specific location (to logically separate it from other page table implementations) is explained at MoveMLPT and found in the Gelato CVS repository.
Detailed information regarding the virtues of changing page tables can be found at:
http://www.gelato.unsw.edu.au/IA64wiki/VariableRadixPageTables
An explanation of which architectures may be well suited to changing page tables can be found here:
The Page Table Interface
NB: The page table interface posted to linux-mm on 29/05/06 to not yet align with the
The page table interface is composed of two parts. The first part is the architecture independent interface. All architecture independent page table operations are performed through this interface. The cost of running through this interface is negligible for the major architectures (i386/IA64/powerPC). The second part of the interface is for architecture specific page table operations.
The architecture independent interface is contained in the include/linux/ directory.
include/linux/default-pt.h contains the general operations that can be performed on the default page table (MLPT). This includes The other basic operation that is performed on the page tables is iteration. The open coding (direct accessing and assumption of implementation) of the MLPT has allowed for many tailored iterations within the VM system. These iterators have been individually abstracted for the MLPT and are broken into read, build and dual iterators. A number of additional files have been added to architecture independent linux code to contain abstracted code for the MLPT. The architecture dependent interface for IA64 is contained in /arch/ia64/mm/mlpt.h
typedef struct pt_struct { pmd_t *pmd; } pt_path_t; These guys are worker functions for the PTI.
This function creates and initialises a page table. It is called in fork.c to initialise the page table for a newly forked user process. For the MLPT implementation, an initialised page table comprises a zeroed out pgd directory ONLY. It returns 1 on success and 0 on failure (out of memory). This function 'creates' and initialises the kernel page table. This compiles out for the MLPT (a zeroed out pgd directory is provided hard coded in assembler for each architecture - known as the swapper_pg_dir). Other page table implementations that use alternative memory management for building page tables are non trivial and won't use swapper_pg_dir. This guy is in arch dep interface - move
This function destroys the page table. It is called in fork.c to free the page table for a user process that is being destroyed. For the MLPT implementation, freeing the page table involves returning the allocated pgd directory to the kernel's quicklists. The pud, pmd, and pte directories have been returned to the kernel via free_pgtables prior to this function being called.
This function looks up a page table (user of kernel) for a mapping. To look up the kernel page table the address of the init process must be passed. The address of the spinlock that covers the pte can be obtained if the lock is to be taken. Call with NULL if the lock is not to be taken. It is the responsibility to the caller to take out the spinlock. This function looks up the the gate area of a page table. The location of the gate area varies with architecture.
This function builds a page table readying it for insertion. For the MLPT implementation, if pud/pmd/pte directories are required to add a particular user page table entry, then they are allocated, and pointers set accordingly. A spinlock will be taken out covering the pte if a variable to hold the spinlock address is provided (call with NULL and no spinlock will be taken out).
[Paul] - add info from Gorman explaining when this function is called. The function free_pgtables tears down a page table between a range of addresses (floor to ceiling). The process address space is broken up into a list of linear regions (vmas) and free_pgtables traverses through this list of vmas, calling tear_down_pgtable_range on the relevant vma regions. addr and end represent the vma range to be torn down. The function coalesce_vmas is called in free_pgtables prior to calling tear_down_pgtable_range. This function creates the illusion of joining a number of vmas into one vma prior to calling tear_down_pgtable_range (done for optimisation purposes for the MLPT implementation). The page table is unused (but possibly previously built) in the address range to be torn down. For the MLPT, tearing down the page table means deallocating relevant pte/pud/pmd directories in the address range.
There are two customised dual iterators, contained in include/mm/mlpt-dual-iterators.h. A dual iterator builds a destination page table whilst iterating over a source page table. Two iterator implementations occur naturally for the 'same conceptual task' because of the different locking requirements. The copy page range iterator is called during fork and mmap. The source address space is contiguously duplicated to the destination address space for keys in the given range. The callback operates on each of the source and destination entries as it iterates. This iterator reads an address space and builds a section of the SAME address space (hence the different locking requirements). It is used by the mremap system call for expanding/shrinking memory mappings.
There are seven customised read iterators, contained in include/mm/mlpt-read-iterators.h. A read iterator visits each entry that exists in the given address range and operates on a function that is passed to it. The unmap iterator unmaps all page table entries in the given range and flushed the TLB. The page table remains built within this range (identically as before) except that all ptes in the range are now NULL (not present). The page table itself can now be torn down msync flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to disk within this range.
The following benchmarks examine the degradation in performance of the Linux kernel on the major architectures as a result of accessing the Linux page tables through the architecture independent interface. It is critical that the page table interface minimizes degradation in performance of the Linux VM system for those architectures that will never change page tables away from the MLPT. Summary: Fork 4.0% deterioration, execve 3.5% deterioration, mmap 5.5% deterioration.
Zoltan Menyhart > What do you mean with "physical mode"? "Not using any TLB entry (or any HW supported address translation stuff) to translate the data addresses before they go out of the CPU." "Walking the page tables in physical mode is insensitive to any TLB purges, therefore these purges do not make sure that there is no other CPU just in the middle of page table walking." Zoltans problem "There is a possibility that walking has already been started, but it has not been completed yet, when "free_pgtables()" runs."
This section is just to log ideas put forward by Adam
The PTI should not change the operation of the kernel (the abstracted kernel should be functionally equivalent to the original). Code is merely abstracted. Testing the PTI begins with making the abstracted code execute. A good deal of the abstracted code is called regularly with the bad abstraction causing the kernel to crash immediately. However there is some code that requires some knowledge merely to get it to run. The following is a list of problems encountered during testing and their solutions. This avoids having to type it at the boot prompt
Leave this until we start reworking the locking, it is low priority. Also what does it have to do with reworking the PTI itereators? - AdamWiggins Tutti howto ssh to crashme. console tutti to access tutti. tutti does not use ldap. ssh to tutti. Different password After console tutti. You will come up with a boot prompt. To get into the management thingy type CTRL T The type * rst to reboot
The non hack version. NB: (notes to myself to fix)
Since Adam wants the quicklist allocator framework kept -> we will have to abstract pgalloc slightly differently. * Lose pt-pgtable.h * I prefer the way I got Adam to do it originally.
There are three patch series to be applied, PTI, LVHPT and GPT. They will be applied in that order.
Patch 1 Shifting mlpt allocation functions from memory.c to pt-default.c Patch 2 Clean up for include/asm-generic/pgtable.h
Added four macros to PTI to shift locking inside the page table implementation
lock_pte(mm, pt_path) - lock the pte pointed to by the previously filled path
unlock_pte(mm, pt_path) - unlock the pte pointed to by the previously fille path
get_pte_lock(mm, pt_path, address) - get pte from a partial path(which may be partial) and lock it
atomic_pte_same(mm, pte, orig_pte, pt_path) - Check the pte pointed to by pte and the original pte has not changed (need the path to provide atomicity)
SHIFT THESE. Page table initialisation
static inline int create_user_page_table(struct mm_struct *mm)
static inline void create_kernel_page_table(void)
Page table destruction
static inline void destroy_user_page_table(struct mm_struct *mm)
Look up a page table
static inline pte_t *lookup_page_table(struct mm_struct *mm, unsigned long address,
spinlock_t **ptl)static inline pte_t *lookup_gate_area(struct mm_struct *mm, unsigned long address)
Build a page table
static inline pte_t *build_page_table(struct mm_struct *mm, unsigned long address,
spinlock_t **ptl)Tear down a page table
static inline void tear_down_pgtable_range(struct mmu_gather **tlb,
unsigned long addr, unsigned long end,
unsigned long floor, unsigned long ceiling)static inline void coallesce_vmas(struct vm_area_struct **vma_p,
struct vm_area_struct **next_p)Dual Iterators
static inline int copy_page_range_iterator(struct mm_struct *dst_mm,
struct mm_struct *src_mm, unsigned long addr, unsigned long end,
struct vm_area_struct *vma, pte_rw_iterator_callback_t func)static inline unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len, mremap_callback_t func)Read Iterators
static inline unsigned long unmap_page_range_iterator(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long addr, unsigned long end,
long *zap_work, struct zap_details *details, zap_pte_callback_t func)static inline unsigned long msync_read_iterator(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, msync_callback_t func)Build Iterators
Benchmarks
Architecture Independent Interface
Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
null null open signal signal fork execve /bin/sh
kernel call I/O stat fstat close install handle process process process
----------------------------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
2.6.17-rc3-vanilla 0.272 0.45354 2.439 0.545 4.784 0.550 2.810 113.5 645.1 3517.9
s.d. (5 runs) 0.000 0.00124 0.006 0.000 0.012 0.000 0.025 0.0 6.7 15.0
2.6.17-rc3-PTI 0.272 0.45044 2.383 0.580 4.813 0.555 2.865 118.0 667.2 3592.8
s.d. (5 runs) 0.000 6.51920 0.012 0.006 0.036 0.000 0.009 0.0 6.4 12.6
File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
0K 0K 1K 1K 4K 4K 10K 10K Mmap Prot Page
kernel Create Delete Create Delete Create Delete Create Delete Latency Fault Fault
----------------------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ------
2.6.17-rc3-vanilla 47.69 20.80 72.55 36.24 75.58 36.41 99.08 39.34 4183.2 1.327 1.00
s.d. 0.05 0.03 0.36 0.16 3.61 0.06 3.60 0.07 27.2 0.058 0.00
2.6.17-rc3 47.61 20.57 72.43 36.17 75.51 36.11 99.18 39.13 4413.0 1.308 1.00
s.d. 0.02 0.04 0.29 0.22 3.66 0.22 3.13 0.11 33.2 0.035 0.00386 results here.
Where I am losing my performance at the moment
The page fault handler is a very hot code path, sensitive to minor code changes and depends heavily on the organization of data structures. Cache line bouncing has a critical influence on page fault performance in SMP systems and becomes particularly significant for large applications (like huge databases or computational applications) that try to minimize startup time by having multiple threads of a process running on different processors in order to initialize their memory structures concurrently. http://www.kernel.org/pub/linux/kernel/people/christoph/
We need to rework the page fault handler abstraction. Unfortunately we are going to have to look at passing around a struct to get back performance. General scribble
Suggestions by Adam
Testing issues
## elilo configuration file generated by elilo 3
## limit the memory with append="mem=512K"
delay=20
default=vmlinux
#append="console=ttyS0,115200"
append="mem=512M"
image=/tftpboot/gelato/tartufi/vmlinux
label="vmlinux"
root=/dev/sda4
read-onlyRolling forward to 2.6.17-rc6
Reworking the Iterators in the PTI and other stuff
PTI - ia64 Patche
Syncing with GPT
Release for LCA 2.6.19-rc3
PTI patch series
