Table of Contents
Contents
Introduction
Extending the Linux page table requires several structural changes to several areas of the current page table implementation. Some of these changes may not be obvious hence the reason this page was created. There is however one question you may ask: Why would you want to extend the page table?
There are several reasons you may want to extend the Linux page table structure. For example:
- To support large amounts of virtual memory. The three-level page table, using 16k pages, uses only 50 bits of virtual address space. While this is enough for most purposes, some applications like to set up large sparse virtual spaces, for security, and for them the more address bits the better.
SuperPages. In large systems super pages help reduce tlb misses by giving the tlb more coverage.
VariableRadixPageTables. A dynamic page table that alters its depth and width depending on memory usage.
and TLB sharing that has been described in the LongFormatVhpt page. TLB sharing allows programs that are able to share text or code segments to also share TLB entries.
Currently Linux uses a static three level page table. The three-level page table is able to be read directly by the hardware of IA32-class machines; other architectures have to copy the information in the three-level page table into a machine-readable cache in an architecture-specific format. On IA64 Linux, this format is currently the short format hardware Page Table, which is a virtual linear array.
A brief explanation
With the advent of virtual memory, memory lookups for a processor required a translation between the virtual address that it has been given to the actual physical memory location that is required. To speed up the virtual to physical translation cpu designers introduced a piece of tiny memory that is located close to the CPU called the Translation Look-aside Buffer (TLB). When a memory request is presented to the CPU, the CPU first enquires with the TLB to see if the virtual to physical translation is resident in the TLB. If it is the CPU can use this translation to access the physical memory, however if it is not present then an alternate course of action is required and is called a TLB miss.
When a TLB miss occurs the general pattern of events are as follows:
- CPU get memory request
- Check TLB
- Hit get memory
- Miss do a memory lookup
A memory lookup is a CPU architecture specific action. Since we are dealing with IA64 I will describe its lookup method.
Action
Hit
Miss
Figure 1. Action of the TLB and hardware walker hit and miss operations.TLB
The page frame of the physical address is returned and the physical address is calculated.
If the hardware walker is active then the format that is set will activate, (i.e. long or short).
VHPT
The page frame of the physical address is returned and the translation is inserted into the TLB.
The software page table is walked, the virtual to physical translation is located and inserted in to the TLB and VHPT. On IA64 linux the current page table has three levels.
== How the hardware walker operates == The Long format Virtual Hashed Page Table (VHPT) is a IA64 architecture specific hash table based on the virtual address to be stored or retrieved. The structure of the VHPT consists of 4 64bit CPU words.
- The first 8 bytes is divided into an array fields that includes: a present bit, access right bits, and physical page number bits.
- The next 8 bytes are significant for super pages and guarded page tables. Fields here include: page size bits and protection key bits.
- The final 16 bytes we do not use though contain: the virtual page number and region identifiers.
Note Details on the Long format VHPT hash can be found in [2]pp(4-19,20) Currently we use the first 8 bytes for physical address storage, and 6 bits of the next 8 bytes for page sizes.
To obtain an address translation from the VHPT it is simply a matter of a single IA64 instruction thash that returns a hash index into the VHPT, then indexing to the required byte from the returned hash value. == Other architectures == Other architectures will use a different approach to page table lookups, the main difference being the hardware walker. I386 for example implements a 2 level page table, or 3 level page table when the processor supports Physical Address Extension (PAE). PAE allows Intel 32 bit processors to access greater than 2GB of physical memory. When a TLB fault occurs on an i386 processor, the software page walker is immediately invoked to perform the PTE lookup.
On i386 Linux PAE memory is called high memory, and requires a PTE extension to allow the high memory information to be stored in the extension. For information on how Linux handles PAE memory search the Linux source in the arch/i386/ and include/asm-i386/ directories for CONFIG_X86_PAE, CONFIG_HIGHMEM, CONFIG_HIGHMEM4G and CONFIG_HIGHMEM64G.
Extending the Page Table
The following discussion is based on modifications required to extend the IA64 page table. I will attempt to generalise as much as possible so that modifications of the page table on other architectures may be clearer.
To help understand the modification that are being undertaken figure 2 presents two page table structures based on IA-64 Linux.
Figure 2a: is a 3-level page table based on 8kB pages. |
|
Figure 2b: displays the PTE extension and the halving of pte entries due to the extra field in the pte struct. Once again this example is based on 8kB page sizes. |
|
The 128 entries in the PGD is a user space restriction for IA-64 only. |
In essence what has been altered by the pte extension is that the total mappable users virtual address space has decreased. For each region this was 1TB, and now with the extensions it is approximately 512GB based on 8KB pages. |
||
The PGD field is broken up into two sections, the first 3 high bits refer to 5 user space regions and the lower 7 bits reference the PGD offset for each region. (see figure 2c and [1] pp 158-160 for further explanation.) |
|
||
|
|
||
|
|||
- == Considerations for Page Table Extensions == When extending the Linux page table considerations that need to be taken into account follow from the discussion of figure 2.
- Because we are increasing the size of a PTE, we are also reducing the number of bits used to index one. An example of this can be seen in the code excerpt of figure 3.
#define PAGE_SHIFT 13 /* 8kB pages */ #define PTE_ENTRY_BITS 4 /* for the standard page table this value would be 3 */ #define PTE_INDEX_BITS (PAGE_SHIFT - PTE_ENTRY_BITS) #define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_BITS) #define PMD_ENTRY_BITS 3 #define PMD_INDEX_BITS (PAGE_SHIFT - PMD_ENTRY_BITS) #define PTRS_PER_PMD (__IA64_UL(1) << PMD_INDEX_BITS) #define PGDIR_SHIFT (PAGE_SHIFT + PTE_INDEX_BITS + PMD_INDEX_BITS) #define PGD_ENTRY_BITS 3 #define PGD_INDEX_BITS (PAGE_SHIFT - PGD_ENTRY_BITS) #define PTRS_PER_PGD (__IA64_UL(1) << PGD_INDEX_BITS) #define USER_PTRS_PER_PGD (5*PTRS_PER_PGD/8) /* total of 8 regions, users space regions 0-4 */Figure 3
An example of pointer reduction due to page table extensions can also be found in include/asm-i386/pgtable*.h. Comparing the two files pgtable-2level.h and pgtable-3level.h can be helpful.
- Each PMD contains pointers to pages of PTEs. Becase the PTEs are now larger, a page of PTEs contains fewer PTEs than before.
- With the reduction of pointers in the PMD's, we have also reduced the amount of virtual memory that can be mapped by the page table. The IA-64 Linux kernel need to be informed about this reduction.
Each architecture has a different method of this and IA-64 uses the RGN_MAP_LIMIT macro. This macro is checked in several places when memory is allocated, such as arch/ia64/mm/fault.c --- here the kernel checks that the page request does not exceed the bounds of a region (see ia64_do_page_fault).
#define RGN_MAP_LIMIT ((1UL << (PGDIR_SHIFT + PGD_INDEX_BITS - 3)) - PAGE_SIZE) /* per region addr limit */
Figure 4
For i386 a good point to start to look for page table adjustments is include/asm-i386/highmem.h and look for LAST_PKMAP. The page tables extensions on i386 actually increase the page map area in the kernel space, an excellent source of information on the can be found in [3]pp 46-48.
Page table structures. Currently there are three structures of equivalent size (though they are declared separately to enable the compiler to identify possible programming errors). Declarations of these structures are of the type:
typedef struct { unsigned long pgd;} pgd_t; typedef struct {unsigned long pmd;} pmd_t; typedef struct {unsigned long pte;}pte_t;Figure 5
Here we just add the extra field to the structure 'pte_t'; for other architectures these typedefs can be found in include/asm-<arch>/page.h.
# defines. There are several static definitions that control the setup and size of the kernel's page table mapped segment. This area is used for vmalloc calls in the kernel. Its size is determined by a start (VMALLOC_START) and an end (VMALLOC_END). These two definitions are dependent on other static definitions within the kernel code, such as the page size of the kernel and the size of the global and middle directories. The adjustments to IA-64 can be seen in figure 6, for i386 and example can be found in include/asm-i386/pgtable.h.
/* original IA-64 definition */ # define VMALLOC_END (0xa000000000000000 + (1UL << (4*PAGE_SHIFT - 9))) /* new IA-64 definition */ # define VMALLOC_END (0xa000000000000000 + (1UL << (PGDIR_SHIFT + PGD_INDEX_BITS)))
Here is where the fun starts, if you have made the above kernel changes and everything still compiles (though should not run) we now need to walk the page table. This part is very hardware specific and may take considerable time. On IA-64, as discussed above in page table lookup, a page fault may be three step operation, TLB, VHPT and page table walk. Also IA-64 has to manage the VHPT and TLB in software, this involves extracting the virtual to physical address mapping and inserting the mapping into the VHPT and the TLB, the code for this can be found the the VHPT patch. TLB updates are performed by the itc.X instructions, where X=[i|d] for instruction or data. The translations must be mapped into the VHPT via a series of instruction:
Retrieve the pte to page frame mapping for the virtual address addr, by walking the page table (see LOAD_PTE_MISS and FIND_PTE)
Retrieve the hash value hpte for addr, via the thash instruction.
Index to the desired hpte field and insert or update the pte value.
Insert or update the TLB entry with pte. === i386 ===
On i386 for instance, when a TLB miss occurs a page fault is triggered and the do_page_fault routine is called. do_page_fault handles checking where the fault occurred, in the TASK area or kernel space, and taking the appropriate action including searching the page table and creating the virtual to physical mapping. TLB updates on i386 are performed as a NULL operation, to me this means that TLB inserts and updates are handled by the hardware, (see handle_pte_fault in mm/memory.c and follow the call path from ptep_establish and update_mmu_cache).
- Because we are increasing the size of a PTE, we are also reducing the number of bits used to index one. An example of this can be seen in the code excerpt of figure 3.
Patches
Page_Table_Entry_Extension patch, requires Long Format VHPT patch for respective kernel version.
References
Mossberger David and Eranian Stephane (2002), IA-64 Linux Kernel Design and Implementation, Prentice Hall Upper Saddle River, New Jersey.
Intel(2000), Intel IA-64 Architecture, Software Developer's Manual rev 01, Vol 2, IA-64 System Architecture.
Gorman Mel(2003), Understanding The Linux Virtual Memory Manager, http://www.skynet.ie/~mel/projects/vm/
Linux Source Code(2004), http://www.kernel.org

Figure 2a
Figure 2b
Figure 2c