Migrate pages - v0.2

From: Zoltan Menyhart <Zoltan.Menyhart_AT_bull.net_at_nospam.org>
Date: 2004-04-02 23:46:13
Here is my next try.

Version 0.2, 2nd of April 2004:
- Efforts made to make it less architecture dependent
- Two big loops of PGD-PMD-PTE scans merged at the expense of
  some "if-then-else"-es and some additional function parameters
- Permission check added
- Excessive DEBUG stuff removed
- Some cosmetics

You can find a description in "Documentation/migrate.txt".

I keep on using a single system call in order not too pollute much...
Until I have an official system call number I just picked up:

	__NR_page_migrate 1276

I was thinking about reading out statistics via "/proc". Well I do not like
converting numbers to strings and making the user program convert them back...
In addition we may have up to 256 nodes, up to 256 * 256 migration counters -
the result is not much readable.

Notes about the data security:
- The process controlling the migration cannot read nor modify the data of the
  victim process.
- The data of the victim process cannot be broken. Should the migration fail,
  a data page will be found intact, in a single example, either on the source
  or the destination NUMA node.
- He who can kill you is allowed to migrate your pages.

The patch is against:        patch-2.6.4.-bk4

TODO: check to see if pulling in pages in to a node is better than pushing them out

Your remarks will be appreciated.

Zoltán Menyhárt


P.S: is the list linux-mm@kvack.org alive ?
diff -Nur 2.6.4.ref/Documentation/migrate.txt 2.6.4.mig2-tmp/Documentation/migrate.txt
--- 2.6.4.ref/Documentation/migrate.txt	Thu Jan  1 01:00:00 1970
+++ 2.6.4.mig2-tmp/Documentation/migrate.txt	Fri Apr  2 14:32:00 2004
@@ -0,0 +1,352 @@
+Migrate pages from a ccNUMA node to another.
+============================================
+
+Version 0.2, 2nd of April 2004
+By Zoltan Menyhart, Bull S.A. <Zoltan.Menyhart@bull.net>
+The usual GPL applies.
+
+What is it all about ?
+----------------------
+
+The old golden days of the Symmetrical Multi-Processor systems are over.
+Gone forever.
+We are left with (cache coherent) Non Uniform Memory Architectures.
+I can see the future.
+I can see systems with hundreds, thousands of processors, with less and less
+uniform memory architectures.
+The "closeness" of a processor to its working set of memory will have the most
+important effect on the performance.
+
+You can make use of the forthcoming NUMA APIs to set up your NUMA environment:
+to bind processes to (groups of ) processors, to define the memory placement
+policy, etc.
+
+Yes, the initial placement is very much important. It affects tremendously the
+performance you obtain.
+
+Yet, what if
+- the application changes its behavior over time ?
+ (which processor uses which part of the memory)
+- you have not got the source of the application ?
+- you cannot add the NUMA services to it ?
+- you are not authorized to touch it ? (e.g. it is a reference benchmark)
+
+Page migration tries to help you out in these situations.
+
+What can this service do ?
+--------------------------
+
+- Migrate pages identified by their physical addresses to another NUMA node
+- Migrate pages of a virtual user address range to another NUMA node
+
+How can it be used ?
+--------------------
+
+1. Hardware assisted migration
+..............................
+
+As you can guess, it is very much platform dependent.
+I can only give you an example:
+
+We've got an Intel IA64 based machine for development / testing.
+It consists of 4 "Tiger boxes" connected together by a pair of Scalability Port
+Switches. A "Tiger box" is built around a Scalable Node Controller (SNC), and
+includes 4 Itanium-2 processors and some Gbytes of memory.
+The NUMA factor is 1 : 2.25.
+The SNC contains 2048 counters which allow us to count how many times these 2048
+zones of memory are touched from each node in a given observation period.
+An "artificial intelligence" can make predictions from these usage statistics
+and decide what pages are to be migrated and where.
+
+(Unfortunately, the SNCs are buggy - even the version C.1 is - we've got to use
+a couple of work-arounds, much of the work has to be done in software.
+This wastes about 10 seconds of CPU time while executing a benchmark of
+2 minutes. I hope, one day...)
+
+2. Application driven migration
+...............................
+
+An application can exploit the forthcoming NUMA APIs to specify its initial
+memory placement policy.
+Yet what if the application wants to change its behavior ?
+
+Allocating room on the destination node, copying the data by the application
+itself, and finally freeing the original room of the data is not very efficient.
+
+An application can ask the migration service to move a range of its virtual
+address space to the destination node.
+
+Example:
+A process of an application prepares a huge amount of data and hands it over to
+Its fellow processes (which happen to be bound to another NUMA node) for their
+(almost) exclusive usage.
+Migrating a page costs 128 remote accesses (assuming a page size of 16 Kbytes
+and a bus transaction size of 128 bytes) + some administration.
+Assuming the consumers of the data will frequently touch the page (cache misses)
+a considerable number of times, say more that 1000 times, then the migration
+becomes largely profitable.
+
+3. NUMA aware scheduler
+.......................
+
+A NUMA aware scheduler tries to keep processes on their "home" node where they
+have allocated (most of) their memory. What if the processors in this node are
+overloaded while several processors in the other nodes are largely in idle ?
+
+Should the scheduler select some other processors in the other nodes to execute
+these processes, at the expense of considerable number of extra node
+transactions ?
+Or should the scheduler leave the processors in the other nodes doing nothing ?
+Or should it move some processes with their memory working set to another node ?
+Let's leave this dilemma for the NUMA aware scheduler for the moment.
+
+Once the scheduler has made up its mind, the migration service can move the
+working set of memory of the selected processes to their new "home" node.
+
+User mode interface
+-------------------
+
+This prototype of the page migration service is implemented as a system call,
+the different forms of which are wrapped by use of some small,
+static, inline functions.
+
+NAME
+        migrate_ph_pages        - migrate pages to another NUMA node
+        migrate_virt_addr_range - migrate virtual address range to another node
+
+SYNOPSIS
+
+        #include <sys/types.h>
+        #include "page_migrate.h"
+
+        int migrate_ph_pages(
+                const phaddr_t * const table,
+                const size_t length,
+                const int node,
+                struct _un_success_count_ * const p,
+                const pid_t pid);
+
+        int migrate_virt_addr_range(
+                const caddr_t address,
+                const size_t length,
+                const int node,
+                struct _un_success_count_ * const p,
+                const pid_t pid);
+
+DESCRIPTION
+
+        The "migrate_ph_pages()" system call is used to migrate pages - their
+        physical addresses of "phaddr_t" type are given in "table" - to "node".
+        "length" indicates the number of the physical addresses in "table" and
+        should not be greater than "PAGE_SIZE / sizeof(phaddr_t)".
+        Only the pages belonging to the process indicated by "pid" and its
+        child processes cloned via "clone2(CLONEVM)" are treated, the other
+        processes' pages are silently ignored.
+
+        The "migrate_virt_addr_range()" system call is used to migrate pages of
+        a virtual address range of "length" starting at "address" to "node".
+        The virtual address range belongs to the process indicated by "pid" and
+        to its cloned children. If "pid" is zero then the current
+        process's virtual address range is moved.
+
+        Some statistics are returned via "p":
+
+        struct _un_success_count_ {
+                unsigned int    successful;     // Pages successfully migrated
+                unsigned int    failed;         // Minor failures
+        };
+
+RETURN VALUE
+
+        "migrate_ph_pages()" and "migrate_virt_addr_range()" return 0 on
+        success, or -1 if a major error occurred (in which case, "errno" is set
+        appropriately). Minor errors are silently ignored (migration continues
+        with the rest of the pages).
+
+ERRORS
+
+        ENODEV:         illegal destination node
+        ESRCH:          no process of "pid" can be found
+        EPERM:          no permission
+        EINVAL:         invalid system call parameters
+        EFAULT:         illegal virtual user address
+        ENOMEM:         cannot allocate memory
+
+RESTRICTIONS
+
+        We can migrate a page if it belongs to a single "mm_struct" / PGD,
+        i.e. it is private to a process or shared with its child processes
+        cloned via "clone2(CLONEVM)".
+
+Notes:
+
+- A "major error" prevents us from carrying on the migration, but it is not a
+  real error for the "victim" application that can continue (it is guaranteed
+  not to be broken). The pages already migrated are left in their new node.
+
+- Migrating a page shared among other than child processes cloned via
+  "clone2(CLONEVM)" would require locking all the page owners' PGDs.
+  I've got serious concerns about locking more than one PGDs:
+  + It is not foreseen in the design of the virtual memory management.
+  + Obviously, the PGDs have to be "trylock()"-ed in order to avoid dead locks.
+    However, "trylock()"-ing lots of PGDs, possibly thousands of them, would
+    lead to starvation problems. A performance enhancement tool consuming so
+    much in the event of not concluding...
+
+Some figures
+------------
+
+One of our customers has an OpenMP benchmark which was used to measure the
+machine described above. It uses 1 Gbytes of memory and runs on 16 processors,
+on 4 NUMA nodes.
+
+If the benchmark is adapted to our NUMA architecture, then it takes 86 seconds
+to complete.
+
+As results are not accepted if obtained by modifying the benchmark in any
+way, the best we can do is to use a random or round robin memory allocation
+policy. We end up with a locality rate of 25 % and the benchmark executes in 121
+seconds.
+
+If we had a zero-overhead migration tool, then - I estimate - it would complete
+In 92 seconds (the benchmark starts in a "pessimized" environment, and it takes
+time for the locality ramp up from 25 % to almost 100 %).
+
+Actually it takes 2 to 3 seconds to move 750 Mbytes of memory (on a heavily
+loaded machine), reading out the counters of the SNCs and making some quick
+decisions take 1 to 2 seconds, and we lose about 10 seconds due to the buggy
+SNCs. We end up with 106 seconds.
+
+Some if's
+---------
+
+- if the benchmark used more memory, then it would be more expensive to migrate
+  all of it's pages
+- if the benchmark ran for longer without changing its memory usage
+  pattern, then it could spend a greater percentage of its lifetime in a well
+  localized environment
+- if you had a NUMA factor higher than ours, then obviously, you would gain
+  more in performance by use of the migration service
+- if we used Madison processors with 6 Mbytes of cache (twice as much we have
+  right now), then the NUMA factor would be masked more efficiently
+- if the clock frequency of the processors increases, then you run out of cached
+  data more quickly and the NUMA factor becomes a higher performance cut factor
+
+Notes about the data security
+-----------------------------
+
+- The process controlling the migration cannot read nor modify the data of the
+  victim process.
+- The data of the victim process cannot be broken. Should the migration fail,
+  a data page will be found intact, in a single example, either on the source
+  or the destination NUMA node.
+- He who can kill you is allowed to migrate your pages.
+
+Porting guide
+-------------
+
+include/asm-.../page_migrate.h:
+...............................
+
+	Copy "include/asm-ia64/page_migrate.h" into the directory
+	"include/asm-<your architecture>
+
+Define "return_t":
+
+	Type of the return value for the system call. Should be able to hold
+	negative values, that of "phaddr_t" and we will cast the structures
+	"_un_success_count_" and "_statistics_size_" to this type.
+
+Define "phaddr_t":
+
+	Type of a physical address.
+
+Define "struct _un_success_count_":
+
+	Number of the pages successfully migrated / minor failures.
+
+Define "_statistics_size_":
+
+	Holds sizes of "struct _statistics_" are.
+
+Define "struct _statistics_":
+
+	The counters should not overflow.
+
+include/asm-.../delay.h:
+........................
+
+Define "GET_TIMER()":
+
+	This macro / function should return the current clock ticks.
+	It can simply return zero => no timing info will be available.
+
+include/linux/pagemap.h:
+........................
+
+Define "__IS_VADDR_ALIAS(address, length)":
+
+	Some architectures do not decode all the MSB-s of virtual
+	addresses for the PGD, PMD and PTE indices, i.e. they have
+	got holes or aliases in the virtual user address space.
+	This macro / function should return TRUE if "length" spans over
+	virtual address holes or it creates an illegal alias to an
+	otherwise valid address.
+
+	IA64 example (assuming a page size of 16 Kbytes):
+	There are 5 user regions, starting at addresses "i << 61" (where "i"
+	goes from 0 to 4). Only the first 16 Tbytes of each region is valid.
+	We count as:
+
+	0, 1, 2, ... 0x000fffffffffffff,
+	0x2000000000000, 0x2000000000001, ... 0x200fffffffffffff,
+	0x4000000000000, etc.
+
+Define "__VA(pgd_i, pmd_i, pte_i)":
+
+	This macro / function converts PGD, PMD and PTE indices into a
+	virtual address. Beware of the illegal virtual address aliases.
+	Not absolutely necessary if not testing. However, as it is the
+	counterpart of "pgd_index()", "pmd_index()" and "pte_index()"...
+
+include/asm-.../unistd.h:
+.........................
+
+	Add "#define __NR_page_migrate 1276"
+
+arch/.../Kconfig:
+.................
+
+Add:
+
+config PAGE_MIGRATE
+	bool "Support for migrating pages from a NUMA node to another"
+	depends on DISCONTIGMEM && NUMA
+	default n
+	help
+	  Say Y to compile the kernel to support migrating either pages
+	  identified by their physical addresses or a user mode virtual
+	  address range from a NUMA node to another.
+	  This option is for optimizing memory allocation pattern for
+	  high-end NUMA server systems.
+	  If in doubt, say N.
+
+To activate page migration, select:
+	Processor type and features
+	  NUMA support
+	    Support for migrating pages from a NUMA node to another
+
+Revision history:
+-----------------
+
+Version 0.1, 25th of March 2004:
+	- Initial version
+
+Version 0.2, 2nd of April 2004:
+	- Efforts made to make it less architecture dependent
+	- Two big loops of PGD-PMD-PTE scans merged at the expense of
+	  some "if-then-else"-es and some additional function parameters
+	- Permission check added
+	- Excessive DEBUG stuff removed
+	- Some cosmetics :-)
+
diff -Nur 2.6.4.ref/arch/ia64/Kconfig 2.6.4.mig2-tmp/arch/ia64/Kconfig
--- 2.6.4.ref/arch/ia64/Kconfig	Tue Mar 16 13:36:30 2004
+++ 2.6.4.mig2-tmp/arch/ia64/Kconfig	Fri Apr  2 11:28:21 2004
@@ -218,6 +218,18 @@
 	  Access).  This option is for configuring high-end multiprocessor
 	  server systems.  If in doubt, say N.
 
+config PAGE_MIGRATE
+	bool "Support for migrating pages from a NUMA node to another"
+	depends on DISCONTIGMEM && NUMA
+	default n
+	help
+	  Say Y to compile the kernel to support migrating either pages
+	  identified by their physical addresses or a user mode virtual
+	  address range from a NUMA node to another.
+	  This option is for optimizing memory allocation pattern for
+	  high-end NUMA server systems.
+	  If in doubt, say N.
+
 config VIRTUAL_MEM_MAP
 	bool "Virtual mem map"
 	default y if !IA64_HP_SIM
diff -Nur 2.6.4.ref/arch/ia64/kernel/acpi.c 2.6.4.mig2-tmp/arch/ia64/kernel/acpi.c
--- 2.6.4.ref/arch/ia64/kernel/acpi.c	Tue Mar 16 10:18:04 2004
+++ 2.6.4.mig2-tmp/arch/ia64/kernel/acpi.c	Fri Apr  2 11:28:17 2004
@@ -457,6 +457,7 @@
 	for (i = 0; i < MAX_PXM_DOMAINS; i++) {
 		if (pxm_bit_test(i)) {
 			pxm_to_nid_map[i] = numnodes;
+			node_set_online(numnodes);
 			nid_to_pxm_map[numnodes++] = i;
 		}
 	}
diff -Nur 2.6.4.ref/arch/ia64/kernel/entry.S 2.6.4.mig2-tmp/arch/ia64/kernel/entry.S
--- 2.6.4.ref/arch/ia64/kernel/entry.S	Tue Mar 16 10:18:04 2004
+++ 2.6.4.mig2-tmp/arch/ia64/kernel/entry.S	Fri Apr  2 11:28:17 2004
@@ -1518,7 +1518,7 @@
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall			// 1275
-	data8 sys_ni_syscall
+	data8 sys_page_migrate
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
diff -Nur 2.6.4.ref/include/asm-ia64/delay.h 2.6.4.mig2-tmp/include/asm-ia64/delay.h
--- 2.6.4.ref/include/asm-ia64/delay.h	Tue Mar 16 10:18:15 2004
+++ 2.6.4.mig2-tmp/include/asm-ia64/delay.h	Fri Apr  2 11:30:14 2004
@@ -20,6 +20,11 @@
 #include <asm/intrinsics.h>
 #include <asm/processor.h>
 
+/*
+ * Architecture independent macro name for reading the timer.
+ */
+#define	GET_TIMER()	ia64_get_itc()
+
 static __inline__ void
 ia64_set_itm (unsigned long val)
 {
diff -Nur 2.6.4.ref/include/asm-ia64/page_migrate.h 2.6.4.mig2-tmp/include/asm-ia64/page_migrate.h
--- 2.6.4.ref/include/asm-ia64/page_migrate.h	Thu Jan  1 01:00:00 1970
+++ 2.6.4.mig2-tmp/include/asm-ia64/page_migrate.h	Fri Apr  2 11:30:14 2004
@@ -0,0 +1,218 @@
+/*
+ * Migrate pages from a NUMA node to another.
+ * ==========================================
+ *
+ * Version 0.2, 2nd of April 2004
+ * By Zoltan Menyhart, Bull S.A. <Zoltan.Menyhart@bull.net>
+ * The usual GPL applies.
+ *
+ * (See "Documentation/migrate.txt".)
+ */
+
+
+/*
+ * Type of the return value for the system call. Should be able to hold
+ * negative values, that of "phaddr_t" and we will cast the structures
+ * "_un_success_count_" and "_statistics_size_" to this type.
+ */
+typedef	long long	return_t;
+
+/*
+ * Type of a physical address -- hopefully enough for all architectures.
+ */
+typedef	long long	phaddr_t;
+
+
+struct _un_success_count_ {
+	unsigned int	successful;		/* Pages successfully migrated */
+	unsigned int	failed;			/* Minor failures */
+};
+
+
+struct _statistics_size_ {
+	unsigned int	sizeof_statistics;	/* sizeof(struct _statistics_) */
+	unsigned int	max_nodes;		/* MAX_NUMNODES */
+};
+
+
+/*
+ * Statistics are accessed in a non atomic way.
+ * Who cares? Just some statistics :-)
+ */
+struct _statistics_ {
+	struct {					/* Error counters */
+		unsigned long	non_existent_addr;
+		unsigned long	page_gone_away;
+		unsigned long	busy;
+		unsigned long	bad_request;
+		unsigned long	no_memory;		/* On the target node */
+		unsigned long	page_type_not_supp;
+		unsigned long	errors;			/* "PageError()" is set */
+	} e;
+	struct {					/* Clock ticks */
+		unsigned long	total;
+		unsigned long	page_alloc;
+		unsigned long	page_free;
+		unsigned long	page_lock;
+		unsigned long	new_page_unlock;
+		unsigned long	page_unlock;
+		unsigned long	pgd_scan;
+		unsigned long	pgd_lock;
+		unsigned long	pgd_unlock;
+		unsigned long	mmap_sem;
+		unsigned long	pte_chain_lock;
+		unsigned long	find_vma;
+		unsigned long	flush_tlb;
+		unsigned long	add_lru;
+		unsigned long	copy;
+		unsigned long	update_mmu_cache;
+		unsigned long	mm_lookup;
+		unsigned long	cyc_per_usec;
+		unsigned long	perfbullctl;
+		unsigned long	pci_cfg_rd;
+		unsigned long	pci_cfg_wr;
+	} t;
+	struct {					/* Event counters */
+		unsigned long	mm_hit;
+		unsigned long	pgd_scan;
+		unsigned long	perfbullctl;
+		unsigned long	pci_cfg_rd;
+		unsigned long	pci_cfg_wr;
+	} c;
+#if defined(__KERNEL__)
+	unsigned long	count[MAX_NUMNODES][MAX_NUMNODES];
+#else
+	unsigned long	count[0][0];
+#endif
+};
+
+
+#if !defined(__KERNEL__)
+
+
+#include <unistd.h>
+#include <sys/types.h>
+
+#if !defined(__NR_page_migrate)
+#define __NR_page_migrate	1276
+#endif
+
+
+/*
+ * Migrate some pages of the process of PID.
+ *
+ * Arguments:	table:	-> physical addresses of the pages
+ *		length:	Indicates the number of the physical addresses
+ *		node:	Destination NUMA node
+ *		p:	-> status returned
+ *		pid:	Only pages belonging to this process and its
+ *			"clone2(CLONEVM)"-ed children are move
+ *
+ * Returns:	-1 if a major error occurred (in which case, "errno" is set
+ *		appropriately). Minor errors are silently ignored (migration
+ *		continues with the rest of the pages).
+ */
+static inline int
+migrate_ph_pages(const phaddr_t				* const table,
+			const size_t			length,
+			const int			node,
+			struct _un_success_count_	* const p,
+			const pid_t			pid)
+{
+	union {
+		return_t			ll;
+		struct _un_success_count_	s;
+	} u;
+
+	u.ll = syscall(__NR_page_migrate, _PHADDR_BATCH_MIGRATE_,
+						table, length, node, pid);
+	if (u.ll == -1)
+		return -1;
+	if (p != NULL){
+		p->successful = u.s.successful;
+		p->failed = u.s.failed;
+	}
+	return 0; 
+}
+
+
+/*
+ * Migrate virtual address range of the process of PID.
+ *
+ * Arguments:	addr:	Starting address of the virtual address range
+ *		length:	Length of the virtual address range
+ *		node:	Destination NUMA node
+ *		p:	-> status returned
+ *		pid:	Only pages belonging to this process and its
+ *			"clone2(CLONEVM)"-ed children are move
+ *
+ * Returns:	-1 if a major error occurred (in which case, "errno" is set
+ *		appropriately). Minor errors are silently ignored (migration
+ *		continues with the rest of the pages).
+ */
+static inline int
+migrate_virt_addr_range(const caddr_t			addr,
+			const size_t			length,
+			const int			node,
+			struct _un_success_count_	* const p,
+			const pid_t			pid)
+{
+	union {
+		return_t			ll;
+		struct _un_success_count_	s;
+	} u;
+
+	u.ll = syscall(__NR_page_migrate, _VA_RANGE_MIGRATE_,
+						addr, length, node, pid);
+	if (u.ll == -1)
+		return -1;
+	if (p != NULL){
+		p->successful = u.s.successful;
+		p->failed = u.s.failed;
+	}
+	return 0; 
+}
+
+
+/*
+ * Obtain the size of the statistics structure.
+ */
+static inline int
+get_stat_sizes(struct _statistics_size_ * const p)
+{
+	union {
+		return_t			ll;
+		struct _statistics_size_	s;
+	} u;
+
+	u.ll = syscall(__NR_page_migrate, _SIZEOF_STATISTICS_, 0, 0, 0, 0);
+	if (u.ll == -1)
+		return -1;
+	if (p != NULL)
+		*p = u.s;
+	return 0; 
+}
+
+
+/*
+ * Fetch and clear the statistics.
+ */
+static inline int
+get_staistics(struct _statistics_ * const p, const long clear_flag)
+{
+	return syscall(__NR_page_migrate, _STATISTICS_, p, clear_flag, 0, 0);
+}
+
+
+/*
+ * Return a physical address.
+ */
+static inline phaddr_t
+gimme_a_ph_address(const caddr_t vaddr)
+{
+	return syscall(__NR_page_migrate, _GIMME_AN_ADDRESS_, vaddr, 0, 0, 0);
+}
+
+
+#endif	/* #if !defined(__KERNEL__) */
+
diff -Nur 2.6.4.ref/include/asm-ia64/pgtable.h 2.6.4.mig2-tmp/include/asm-ia64/pgtable.h
--- 2.6.4.ref/include/asm-ia64/pgtable.h	Tue Mar 16 10:18:15 2004
+++ 2.6.4.mig2-tmp/include/asm-ia64/pgtable.h	Fri Apr  2 11:30:14 2004
@@ -112,6 +112,30 @@
 #define PTRS_PER_PTE	(__IA64_UL(1) << (PAGE_SHIFT-3))
 
 /*
+ * The IA64 architecture does not decode all the MSB-s of virtual addresses for
+ * PGD, PMD and PTE indices, i.e. IA64 has got holes or aliases in the virtual
+ * address space.
+ * These def's are provided to check to see if an "address" -- "length" pair
+ * spans over virtual address holes or it creates illegal alias to an otherwise
+ * valid address. (User mode virtual addresses only.)
+ */
+#define	__VADDR_BITS_PER_REGION	(PAGE_SHIFT - 3 - 3 +	/* PGD low index */	\
+				2 * (PAGE_SHIFT - 3) +	/* PMD & PTE indices */	\
+				PAGE_SHIFT)		/* The page itself */
+#define	__VADDR_ALIAS_MASK		((1UL << __VADDR_BITS_PER_REGION) - 1)
+#define	__IS_VADDR_ALIAS(address, length)					\
+			((~__VADDR_ALIAS_MASK & (address)) !=			\
+			(~__VADDR_ALIAS_MASK & ((address) + (length) - 1)))
+
+/*
+ * Virtual address composed by use of PGD, PMD and PTE indices:
+ */
+#define	__VA(pgdi, pmdi, ptei)							\
+			(((pgdi) >> (PAGE_SHIFT - 6)) << 61 |			\
+			((pgdi) & ((PTRS_PER_PGD >> 3) - 1)) << PGDIR_SHIFT |	\
+			(pmdi) << PMD_SHIFT | (ptei) << PAGE_SHIFT)
+
+/*
  * All the normal masks have the "page accessed" bits on, as any time
  * they are used, the page is accessed. They are cleared only by the
  * page-out routines.
@@ -325,8 +349,10 @@
 	(init_mm.pgd + (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)))
 
 /* Find an entry in the second-level page table.. */
-#define pmd_offset(dir,addr) \
-	((pmd_t *) pgd_page(*(dir)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
+#define pmd_index(addr) \
+	(((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
+#define pmd_offset(dir, addr) \
+	((pmd_t *) pgd_page(*(dir)) + pmd_index(addr))
 
 /*
  * Find an entry in the third-level page table.  This looks more complicated than it
diff -Nur 2.6.4.ref/include/asm-ia64/unistd.h 2.6.4.mig2-tmp/include/asm-ia64/unistd.h
--- 2.6.4.ref/include/asm-ia64/unistd.h	Tue Mar 16 10:18:15 2004
+++ 2.6.4.mig2-tmp/include/asm-ia64/unistd.h	Fri Apr  2 11:30:14 2004
@@ -251,6 +251,7 @@
 #define __NR_reserved1			1259	/* reserved for NUMA interface */
 #define __NR_reserved2			1260	/* reserved for NUMA interface */
 #define __NR_reserved3			1261	/* reserved for NUMA interface */
+#define __NR_page_migrate		1276
 
 #ifdef __KERNEL__
 
diff -Nur 2.6.4.ref/include/linux/page_migrate.h 2.6.4.mig2-tmp/include/linux/page_migrate.h
--- 2.6.4.ref/include/linux/page_migrate.h	Thu Jan  1 01:00:00 1970
+++ 2.6.4.mig2-tmp/include/linux/page_migrate.h	Fri Apr  2 11:30:16 2004
@@ -0,0 +1,81 @@
+#define	_TEST_
+#define	_NEED_STATISTICS_
+
+#if	!defined(_PAGE_MIGRATE_)
+#define	_PAGE_MIGRATE_
+
+
+/*
+ * Migrate pages from a NUMA node to another.
+ * ==========================================
+ *
+ * Version 0.2, 2nd of April 2004
+ * By Zoltan Menyhart, Bull S.A. <Zoltan.Menyhart@bull.net>
+ * The usual GPL applies.
+ *
+ * (See "Documentation/migrate.txt".)
+ *
+ * System call syntax:
+ * -------------------
+ *
+ *	return_t sys_page_migrate(int command, caddr_t address, size_t length,
+ *							int node, pid_t pid);
+ *
+ * On error "-1" is returned and "errno" holds the error code.
+ *
+ * The following commands are available:
+ */
+enum {
+/*
+ * - Return a physical address:
+ */
+	_GIMME_AN_ADDRESS_,
+/*
+ *   On entry, if "address" is a valid virtual address in the address space of
+ *   the current task with an existing backing page, then its physical address
+ *   is returned.
+ *   (Testing only, the kernel has to be compiled with "#define	_TEST_".)
+ *
+ * - Fetch and clear the statistics:
+ */
+	_STATISTICS_,
+/*
+ *   "address" is a pointer to the user's buffer. If "length != 0" then having
+ *   been fetched, the statistics get cleared.
+ *   The other arguments are don't care.
+ *
+ * - Obtain the size of the statistics structure in "struct _statistics_size_":
+ */
+	_SIZEOF_STATISTICS_,
+/*
+ *   The arguments are don't care.
+ *
+ * - Batch migrate pages from a NUMA node to another.
+ */
+	_PHADDR_BATCH_MIGRATE_,
+/*
+ *   "address" points at the user table containing the physical address of the
+ *   pages to be migrated.
+ *   "length" is the number of the physical addresses in the buffer.
+ *   Max. "PAGE_SIZE / sizeof(phaddr_t)" of them can be migrated at once.
+ *   "node" is the destination NUMA node.
+ *   Addresses are assumed to belong to the process indicated by "pid".
+ *   The number of the pages actually migrated is returned,
+ *   see "struct _un_success_count_.
+ *
+ * - Migrate virtual address range of a process:
+ */
+	_VA_RANGE_MIGRATE_,
+/*
+ *   "address" is the starting virtual address in a process'es address space.
+ *   "length" is the length of the address range to be migrated
+ *   Addresses are assumed to belong to the process indicated by "pid".
+ *   The number of the pages actually migrated is returned,
+ *   see "struct _un_success_count_.
+ */
+};
+
+#include <asm/page_migrate.h>
+
+#endif	/* #if !defined(_PAGE_MIGRATE_) */
+
diff -Nur 2.6.4.ref/kernel/sys.c 2.6.4.mig2-tmp/kernel/sys.c
--- 2.6.4.ref/kernel/sys.c	Tue Mar 16 10:18:17 2004
+++ 2.6.4.mig2-tmp/kernel/sys.c	Fri Apr  2 11:30:46 2004
@@ -260,6 +260,7 @@
 cond_syscall(sys_shmget)
 cond_syscall(sys_shmdt)
 cond_syscall(sys_shmctl)
+cond_syscall(sys_page_migrate)
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read)
diff -Nur 2.6.4.ref/mm/Makefile 2.6.4.mig2-tmp/mm/Makefile
--- 2.6.4.ref/mm/Makefile	Tue Mar 16 10:18:17 2004
+++ 2.6.4.mig2-tmp/mm/Makefile	Fri Apr  2 11:30:51 2004
@@ -11,4 +11,6 @@
 			   page_alloc.o page-writeback.o pdflush.o readahead.o \
 			   slab.o swap.o truncate.o vmscan.o $(mmu-y)
 
+obj-$(CONFIG_PAGE_MIGRATE) += migrate.o
+
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
diff -Nur 2.6.4.ref/mm/migrate.c 2.6.4.mig2-tmp/mm/migrate.c
--- 2.6.4.ref/mm/migrate.c	Thu Jan  1 01:00:00 1970
+++ 2.6.4.mig2-tmp/mm/migrate.c	Fri Apr  2 14:51:42 2004
@@ -0,0 +1,888 @@
+/*
+ * Migrate pages from a ccNUMA node to another.
+ * ============================================
+ *
+ * Version 0.2, 2nd of April 2004
+ * By Zoltan Menyhart, Bull S.A. <Zoltan.Menyhart@bull.net>
+ * The usual GPL applies.
+ *
+ * See also "Documentation/migrate.txt" and "page_migrate.h".
+ */
+
+
+#include <linux/mm.h>
+#include <linux/smp_lock.h>
+#include <linux/pagemap.h>
+#include <linux/rmap-locking.h>
+#include <linux/swap.h>
+#include <linux/vmalloc.h>
+#include <asm/rmap.h>
+#include <asm/tlbflush.h>
+#include <linux/page_migrate.h>
+#include <linux/delay.h>		/* For "GET_TIMER()" */
+
+
+/*
+ * Type of virtual addresses. Pointers converted to this type to be able to
+ * do some atithmetics.
+ */
+typedef	unsigned long	vaddr_t;
+
+
+#if defined(_TEST_)
+
+/*
+ * Set the bits - as defined below - for some kernel messages.
+ */
+unsigned int _pr_flag_;
+
+#define	PRINT_errors	1
+#define	PRINT_etc	2
+#define	PRINT_pgd	4		/* Show PGD scan */
+
+#define PRINT(args...)		do {						\
+					if (_pr_flag_)				\
+						printk(args);			\
+				} while (0)
+#define PRINT_ERR(args...)	do {						\
+					if (_pr_flag_ & PRINT_errors)		\
+						printk(args);			\
+				} while (0)
+#define PRINT_ETC(args...)	do {						\
+					if (_pr_flag_ & PRINT_etc)		\
+						printk(args);			\
+				} while (0)
+#define PRINT_PGD(args...)	do {						\
+					if (_pr_flag_ & PRINT_pgd)		\
+						printk(args);			\
+				} while (0)
+
+static const char dest_not_online[] =	"Destination node not online\n";
+static const char no_vma[] =		"Cannot find VMA for address 0x%lx\n";
+static const char illegal_pid[] =	"Illegal PID\n";
+static const char inv_n_addresses[] = 	"Invalid number of addresses";
+static const char ill_va_alias[] =	"v-addr alias in range: 0x%p...0x%p\n";
+static const char no_momory[] =		"No more memory\n";
+static const char ill_user_buff[] =	"Illegal user buffer address\n";
+
+phaddr_t	gimme_an_address(const caddr_t);
+
+#define	STATIC
+#define	INLINE
+
+#else	/* #if defined(_TEST_) */
+
+#define PRINT(args...)		do { } while (0)
+#define PRINT_ERR(args...)	do { } while (0)
+#define PRINT_ETC(args...)	do { } while (0)
+#define PRINT_PGD(args...)	do { } while (0)
+
+#define	STATIC			static
+#define	INLINE			inline
+
+#endif	/* #if defined(_TEST_) */
+
+
+STATIC return_t
+common_page_migrate(const int, const caddr_t, const size_t, const int,
+								const pid_t);
+
+STATIC INLINE return_t
+validate_migrate_pages(const int, const vaddr_t, const vaddr_t, const int,
+						struct mm_struct * const);
+
+STATIC INLINE int
+migr_1_page_by_pte(const int, const phaddr_t * const, const size_t, const int,
+				struct mm_struct * const, pte_t * const);
+
+STATIC INLINE return_t
+batch_migrate(const caddr_t, size_t, const int, const pid_t);
+
+int
+check_migr_1_page(struct page * const, struct page * const,
+					struct mm_struct * const, pte_t * const);
+
+
+/*
+ * These are the flags which are copied for the new page:
+ */
+#define	FLAG_MASK	(PG_referenced | PG_uptodate | PG_dirty | PG_active |	\
+			 PG_highmem | PG_arch_1 | PG_private | PG_writeback |	\
+			 PG_nosave | PG_mappedtodisk | PG_reclaim | PG_compound)
+
+
+/*
+ * Migration type for "common_page_migrate()":
+ */
+enum {	_VADDR_MIG,			/* Virtual address range */
+	_PHADDR_MIG,			/* List of physical addresses */
+};
+
+
+#if defined(_NEED_STATISTICS_)
+
+/*
+ * Statistics are accessed in a non atomic way. Who cares? Just some statistics...
+ */
+STATIC struct _statistics_	_statistics;
+STATIC struct _statistics_size_	_statistics_sizes =
+					{sizeof _statistics, MAX_NUMNODES};
+
+#define	DECLARE_ITC_VAR(var)		unsigned long var
+#define	SAVE_ITC(var)			var = GET_TIMER()
+#define	STORE_DELAY(var, dest)		_statistics.t.dest += GET_TIMER() - var
+#define	COUNT(what)			_statistics.c.what++
+#define	ERROR_CNT(what)			_statistics.e.what++
+#define	ERROR_CNT_ADD(var, delta)	_statistics.e.var += delta
+#define	MOVED(from, to)			_statistics.count[from][to]++
+
+STATIC INLINE int	page_migrate_statistics(const caddr_t, const int);
+
+#else
+
+#define	DECLARE_ITC_VAR(var)
+#define	SAVE_ITC(var)			do { } while (0)
+#define	STORE_DELAY(var, dest)		do { } while (0)
+#define	COUNT(what)			do { } while (0)
+#define	ERROR_CNT(what)			do { } while (0)
+#define	ERROR_CNT_ADD(var, delta)	do { } while (0)
+#define	MOVED(from, to)			do { } while (0)
+
+#endif	/* #if defined(_NEED_STATISTICS_) */
+
+
+/*
+ * Migrate pages from a NUMA node to another (and some other minor services).
+ * (See "Documentation/migrate.txt" and "page_migrate.h".)
+ *
+ * As usual, "-Exxx" returned on errors.
+ */
+asmlinkage return_t
+sys_page_migrate(const int cmd, const caddr_t address, const size_t length,
+						const int node, const pid_t pid)
+{
+	return_t	rc;
+	DECLARE_ITC_VAR(time);		/* Total time for "sys_page_migrate()" */
+
+	SAVE_ITC(/* out */ time);
+	PRINT("\nsys_page_migrate(%d, 0x%p, 0x%lx, %d, %d): pid = %d\n",
+				cmd, address, length, node, pid, current->pid);
+	switch (cmd){
+	/*
+	 * Migrate some pages from a NUMA node to another.
+	 */
+	case _PHADDR_BATCH_MIGRATE_:
+		if (length > PAGE_SIZE / sizeof(phaddr_t)){
+			PRINT_ERR(inv_n_addresses);
+			ERROR_CNT(bad_request);
+			rc = -EINVAL;
+			break;
+		}
+		rc = batch_migrate(address, length, node, pid);
+		break;
+	/*
+	 * Migrate virtual address range.
+	 */
+	case _VA_RANGE_MIGRATE_:
+		/*
+		 * Some architectures do not decode all the MSB-s of virtual
+		 * addresses for the PGD, PMD and PTE indices, i.e. they have
+		 * got holes or aliases in the virtual address space. Make sure
+		 * that "length" does not span over virtual address holes nor
+		 * it creates illegal alias to an otherwise valid address.
+		 */
+		if (__IS_VADDR_ALIAS((vaddr_t) address, length)){
+			PRINT_ERR(ill_va_alias, address, address + length);
+			ERROR_CNT(non_existent_addr);
+			rc = -EFAULT;
+			break;
+		}
+		rc = common_page_migrate(_VADDR_MIG, address, length, node, pid);
+		break;
+
+#if defined(_NEED_STATISTICS_)
+	case _STATISTICS_:
+		rc = page_migrate_statistics(address, length != 0);
+		break;
+	case _SIZEOF_STATISTICS_:
+		rc =  *(return_t *) &_statistics_sizes; /* Yeh, I know... */
+		break;
+#endif
+#if defined(_TEST_)
+	case _GIMME_AN_ADDRESS_:
+		rc = (return_t) gimme_an_address(address);
+		break;
+#endif
+	default:
+		ERROR_CNT(bad_request);
+		rc = -EINVAL;
+		break;
+	}
+	STORE_DELAY(time, /* out */ total);
+	return rc;
+}
+
+
+/*
+ * Migrate some pages identified by their physical address from a NUMA node to
+ * another.
+ *
+ * Arguments:	table:	-> the user buffer containing the physical addresses of
+ *			the pages to be migrated.
+ *			Max. "PAGE_SIZE / sizeof(phaddr_t *)" of them can be
+ *			migrated at once.
+ *		n:	Number of the physical page addresses
+ *		node:	Destination NUMA node
+ *		pid:	Pages are assumed to belong to this process
+ *
+ * Returns:	On (partial) success, some statics are returned.
+ *		As usual, "-Exxx" returned on errors.
+ */
+STATIC INLINE return_t
+batch_migrate(const caddr_t table, const size_t n, const int node,
+								const pid_t pid)
+{
+	return_t	rc;
+	phaddr_t	*bp;
+	DECLARE_ITC_VAR(alloc_time);		/* Time for "vmalloc()" */
+
+	/*
+	 * Fetch the table of the addresses.
+	 */
+	SAVE_ITC(/* out */ alloc_time);
+	bp = vmalloc(PAGE_SIZE);
+	STORE_DELAY(alloc_time, /* out */ page_alloc);
+	if (bp == NULL){
+		PRINT_ERR(no_momory);
+		ERROR_CNT(no_memory);
+		return -ENOMEM;
+	}
+	if (copy_from_user(bp, table, n * sizeof(phaddr_t)) != 0){
+		vfree(bp);
+		PRINT_ERR(ill_user_buff);
+		ERROR_CNT(bad_request);
+		return -EFAULT;
+	}
+	rc = common_page_migrate(_PHADDR_MIG, (caddr_t) bp, n, node, pid);
+	vfree(bp);
+	return rc;
+}
+
+
+/*
+ * Look up an "mm_struct" belonging to a process ID.
+ * We require to migrate the memory pages of someone similar rights which are
+ * necessary to kill her/him.
+ *
+ * Arguments:	pid:	ID of the victim process, "0" means myself
+ *		rcp:	-> detailed error code
+ *
+ * Returns:	On success, a pointer to the victim "mm_struct" is returned.
+ *		"NULL" is returned on failure and the "-Exxx" in "*rcp".
+ *
+ * Notes:	- On success, "->mm_users" gets incremented to make sure that
+ *		  "mm_struct" does not go away
+ *		- "->mm" of a kernel thread is "NULL"; anyway, we don't dare to
+ *		  touch a kernel thread
+ */
+STATIC INLINE struct mm_struct *
+look_up_mm(const pid_t pid, return_t * const rcp)
+{
+	struct task_struct	*p;
+	struct mm_struct	*mm;
+	DECLARE_ITC_VAR(time);			/* "mm" look up time */
+
+	SAVE_ITC(/* out */ time);
+	read_lock(&tasklist_lock);
+	if ((p = find_task_by_pid(pid)) == NULL){
+		read_unlock(&tasklist_lock);
+		STORE_DELAY(time, /* out */ mm_lookup);
+		*rcp = -ESRCH;
+		return NULL;
+	}
+	if (current->session != p->session && current->euid != p->suid &&
+			current->euid != p->uid && current->uid != p->suid &&
+				current->uid != p->uid && !capable(CAP_KILL)){
+		read_unlock(&tasklist_lock);
+		STORE_DELAY(time, /* out */ mm_lookup);
+		*rcp = -EPERM;
+		return NULL;
+	}
+	/*
+	 * "get_task_mm()" includes "task_lock()" that "nests both inside and
+	 * outside of read_lock(&tasklist_lock)" - as a note in "sched.h" states.
+	 */
+	if ((mm = get_task_mm(p)) == NULL)	/* If kernel thread... */
+		*rcp = -EPERM;
+	read_unlock(&tasklist_lock);
+	STORE_DELAY(time, /* out */ mm_lookup);
+	return mm;
+}
+
+
+/*
+ * Common page migration routine.
+ *
+ * Arguments:	type:	If _VADDR_MIG (Virtual address range):
+ *
+ *			addr:	Starting virtual address in a process'es address
+ *				space
+ *			ln:	Length of the address range to be migrated
+ *
+ *			Else _PHADDR_MIG (List of physical addresses):
+ *
+ *			addr:	-> the page aldligned buffer containing the
+ *				physical addresses of the pages to be migrated
+ *			ln:	Number of the physical page addresses
+ *
+ *		node:	Destination NUMA node
+ *		pid:	ID of the victim process, "0" means myself
+ *
+ * Returns:	On (partial) success, some statics are returned.
+ *		As usual, "-Exxx" returned on errors.
+ */
+STATIC return_t
+common_page_migrate(const int type, const caddr_t addr, const size_t ln,
+						const int node, const pid_t pid)
+{
+	struct mm_struct	*mm;
+	return_t		rc;
+	struct vm_area_struct	*beg_vma;
+	DECLARE_ITC_VAR(vma_time);	/* Time for "find_vma()" */
+	DECLARE_ITC_VAR(mmap_sem);	/* Time for "down_read(&mm->mmap_sem)" */
+	DECLARE_ITC_VAR(pgd_lock);	/* "spin_lock(&mm->page_table_lock)" */
+	DECLARE_ITC_VAR(pgd_unlock);	/* "spin_unlock(&mm->page_table_lock)" */
+
+	if (!node_online(node)){
+		PRINT_ERR(dest_not_online);
+		ERROR_CNT(bad_request);
+		return -ENODEV;
+	}
+	if (pid != 0 && pid != current->pid){
+		if ((mm = look_up_mm(pid, &rc)) == NULL){
+			PRINT_ERR(illegal_pid);
+			ERROR_CNT(bad_request);
+			return rc;
+		}
+	} else {
+		mm = current->mm;
+		/*
+		 * Actually, there is no need to grab "mm" because it is ours.
+		 * As we do not want to ask questions when releasing it...
+		 * It is safe just to increment the counter.
+		 */
+		atomic_inc(&mm->mm_users);
+	}
+	SAVE_ITC(/* out */ mmap_sem);
+	down_read(&mm->mmap_sem);		/* Protect the VMA list */
+	STORE_DELAY(mmap_sem, /* out */ mmap_sem);
+	if (type == _VADDR_MIG){
+		/*
+		 * Check if the starting virtual "addr" is valid.
+		 * Some architectures do not decode all the MSB-s of virtual
+		 * addresses for the PGD, PMD and PTE indices, i.e. they have
+		 * got holes or aliases in the virtual address space.
+		 * Make sure that illegal aliases (to valid virtual addresses)
+		 * are rejected.
+		 */
+		SAVE_ITC(/* out */ vma_time);
+		beg_vma = find_vma(mm, (vaddr_t) addr);
+		STORE_DELAY(vma_time, /* out */ find_vma);
+		if (beg_vma == NULL || beg_vma->vm_start > (vaddr_t) addr){
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			PRINT_ERR(no_vma, (vaddr_t) addr);
+			ERROR_CNT(non_existent_addr);
+			return -EFAULT;
+		}
+	}
+	/*
+	 * We need the page table lock to synchronize with "kswapd"
+	 * and the SMP-safe atomic PTE updates.
+	 */
+	SAVE_ITC(/* out */ pgd_lock);
+	spin_lock(&mm->page_table_lock);
+	STORE_DELAY(pgd_lock, /* out */ pgd_lock);
+	/*
+	 * Look up the pages in the PGD and migrate them one by one.
+	 * (No harm if page-masking the address of the buffer holding the
+	 *  physical addresses of the pages.)
+	 */
+	rc = validate_migrate_pages(type, (vaddr_t) addr & PAGE_MASK,
+								ln, node, mm);
+	/*
+	 * Let the others complete the page fault handler code. They will find
+	 * the condition "someone has already installed the PTE" to be TRUE.
+	 */
+	SAVE_ITC(/* out */ pgd_unlock);
+	spin_unlock(&mm->page_table_lock);
+	STORE_DELAY(pgd_unlock, /* out */ pgd_unlock);
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return rc;
+}
+
+
+/*
+ * Validate and migrate pages belonging to a PGD. We check the user pages only.
+ *
+ * Arguments:	type:	If _VADDR_MIG (Virtual address range):
+ *
+ *			addr:	Starting virtual address in a process'es address
+ *				space
+ *			ln:	Length of the address range to be migrated
+ *
+ *			Else _PHADDR_MIG (List of physical addresses):
+ *
+ *			addr:	-> the page aligned buffer containing the
+ *				physical addresses of the pages to be migrated
+ *			ln:	Number of the physical page addresses
+ *
+ *		node:	Destination NUMA node
+ *		mm:	-> victim "mm_struct"
+ *
+ * Returns:	On (partial) success, the number of the pages actually migrated
+ *		is returned (in form of "struct _un_success_count_").
+ *		As usual, "-Exxx" returned on errors
+ *
+ * Notes:	- Caller has to hold "mm->mmap_sem" for read and
+ *		  "mm->page_table_lock".
+ *		- For migrtating a range of virtual addresses, we've already
+ *		  checked that it is safe to start walking the PGD, the PMD and
+ *		  the PTE at "addr". We've also checked that "[addr...ulimit)"
+ *		  does not span over virtual address range holes nor it creates
+ *		  an illegal alias to an otherwise valid address.
+ */
+STATIC INLINE return_t
+validate_migrate_pages(const int type, const vaddr_t addr, const vaddr_t ln,
+				const int node, struct mm_struct * const mm)
+{
+	vaddr_t			vaddr;
+	vaddr_t			ulimit; 
+	unsigned long		g, m, e;	/* PGD, MPD and PTE indices */
+	const pgd_t		*pgd;
+	const pmd_t		*pmd;
+	pte_t			*pte, *pte0;
+	int			rc;
+	struct _un_success_count_	count = {0, 0};
+	DECLARE_ITC_VAR(pgd_scan_t);		/* PGD scan time */
+
+	if (type == _PHADDR_MIG){
+		vaddr = 0;
+		ulimit = TASK_SIZE;		/* No limit */
+	} else {
+		vaddr = addr;
+		ulimit = PAGE_ALIGN((vaddr_t) addr + ln);	/* Round up */
+	}
+	g = pgd_index(vaddr);		/* PGD scan starts here */
+	m = pmd_index(vaddr);		/* The 1st PMD scan starts here */
+	e = pte_index(vaddr);		/* The 1st PTE scan starts here */
+	SAVE_ITC(/* out */ pgd_scan_t);
+	for (pgd = mm->pgd + g; vaddr < ulimit && g < USER_PTRS_PER_PGD;
+					/* Next PMD scan starts at index 0 */
+							m = 0, g++, pgd++){
+		PRINT_PGD("address: 0x%016lx pgd: 0x%p ", vaddr, pgd);
+		PRINT_PGD("g: 0x%lx m: 0x%lx e: 0x%lx\n", g, m, e);
+		PRINT_PGD("__VA():\t 0x%016lx\n", __VA(g, m, e));
+		/*
+		 * Migration tolarates holes in the virtual address space.
+		 */
+		if (pgd_none(*pgd) || pgd_bad(*pgd)){
+			vaddr &= ~(PTRS_PER_PMD * PTRS_PER_PTE * PAGE_SIZE - 1);
+			vaddr += PTRS_PER_PMD * PTRS_PER_PTE * PAGE_SIZE;
+			continue;
+		}
+		for (pmd = pmd_offset(pgd, 0) + m;
+					m < PTRS_PER_PMD && vaddr < ulimit;
+					/* Next PTE scan starts at index 0 */
+							e = 0, m++, pmd++){
+			if (pmd_none(*pmd) || pmd_bad(*pmd)){
+				vaddr &= ~(PTRS_PER_PTE * PAGE_SIZE - 1);
+				vaddr += PTRS_PER_PTE * PAGE_SIZE;
+				continue;
+			}
+			pte0 = pte_offset_map(pmd, 0);
+			for (pte = pte0 + e; e < PTRS_PER_PTE && vaddr < ulimit;
+						vaddr += PAGE_SIZE, e++, pte++){
+				if (!pte_present(*pte))
+					continue;
+				PRINT("\nVirtual addr:\t0x%016lx\n",
+								__VA(g, m, e));
+				STORE_DELAY(pgd_scan_t, /* out */ pgd_scan);
+				rc = migr_1_page_by_pte(type, (phaddr_t *) addr,
+							ln, node, mm, pte);
+				if (rc < 0){
+					pte_unmap(pte0);
+					return rc;
+				}
+				SAVE_ITC(/* out */ pgd_scan_t);
+				switch (rc){
+				case 2:	/* "*pte" does not match against the */
+					/* table of the physical addresses */
+					continue;
+				case 1:
+					count.successful++;
+					break;
+				default:
+					count.failed++;
+				}
+			}
+			pte_unmap(pte0);
+		}
+	}
+	STORE_DELAY(pgd_scan_t, /* out */ pgd_scan);
+	return *(return_t *) &count;		/* Yeh, I know... */
+}
+
+
+/*
+ * Check a PTE against the list of the physical addresses.
+ *
+ * Arguments:	addr:	-> the page aligned buffer containing the
+ *			physical addresses of the pages to be migrated
+ *		ln:	Number of the physical page addresses
+ *		pte:	PTE of the page to be moved
+ *
+ * Returns:	1:	PTE does not match
+ *		0:	A physical address corresponds to the PTE
+ *
+ * Note:	"mm->page_table_lock" and "mm->mmap_sem" have to be held.
+ */
+STATIC INLINE int
+check_phys_address(const phaddr_t *addr, size_t ln, const pte_t pte)
+{
+	for (; ln > 0; ln--, addr++)
+		if ((pte_val(pte) & _PFN_MASK) == (*addr & _PFN_MASK))
+			return 0;
+	return 1;
+}
+
+
+/*
+ * Migrate a page identified by its PTE.
+ *
+ * Arguments:	type:	If _VADDR_MIG (Virtual address range):
+ *
+ *			addr:	*do not care*
+ *			ln:	*do not care*
+ *
+ *			Else _PHADDR_MIG (List of physical addresses):
+ *
+ *			addr:	-> the page aligned buffer containing the
+ *				physical addresses of the pages to be migrated
+ *			ln:	Number of the physical page addresses
+ *
+ *		node:	Destination NUMA node
+ *		mm:	-> victim "mm_struct"
+ *		pte:	-> PTE of the page to be moved
+ *
+ * Returns:	1:	Success
+ *		2:	PTE does not match against the table of the
+ *			physical addresses
+ *		0:	We cannot cope with this page (it is valid, though)
+ *		-Exxx:	Fatal errors
+ *
+ * Note:	"mm->page_table_lock" and "mm->mmap_sem" have to be held.
+ */
+STATIC INLINE int
+migr_1_page_by_pte(const int type, const phaddr_t * const addr, const size_t ln,
+		const int node, struct mm_struct * const mm, pte_t * const pte)
+{
+	const phaddr_t	old_addr = pte_val(*pte) & _PFN_MASK;
+	const int	src_node = paddr_to_nid(old_addr);
+	struct page	* const old_p = pfn_to_page(old_addr >> PAGE_SHIFT);
+	struct page	*new_p;
+	int		rc;
+	DECLARE_ITC_VAR(alloc_time);	/* Time for "vmalloc()" */
+	DECLARE_ITC_VAR(lock_time);	/* Time for "lock_page()" */
+	DECLARE_ITC_VAR(unlock_time);	/* Time for "unlock_page()" */
+	DECLARE_ITC_VAR(free_time);	/* Time for "__free_pages()" and */
+					/* "page_cache_release()" */
+
+	if (type == _PHADDR_MIG && check_phys_address(addr, ln, *pte))
+		return 2;
+	if (node == src_node){
+		PRINT_ETC("Old ph addr:\t0x%016llx old node: %d new node: %d\n",
+						old_addr, src_node, node);
+		return 1;
+	}
+	SAVE_ITC(/* out */ alloc_time);
+	new_p = alloc_pages_node(node, GFP_HIGHUSER | __GFP_NORETRY, 0);
+	STORE_DELAY(alloc_time, /* out */ page_alloc);
+	if (new_p == NULL){
+		PRINT_ERR("No more memory on node %d\n", node);
+		ERROR_CNT(no_memory);
+		return -ENOMEM;
+	}
+	/*
+	 * Make sure the old page is not set free while we hold its lock.
+	 */
+	get_page(old_p);
+	/*
+	 * TODO: should "lock_page()" take too much time, try-lock instead.
+	 */
+	SAVE_ITC(/* out */ lock_time);
+	lock_page(old_p);
+	STORE_DELAY(lock_time, /* out */ page_lock);
+
+	rc = check_migr_1_page(old_p, new_p, mm, pte);
+
+	SAVE_ITC(/* out */ unlock_time);
+	unlock_page(old_p);
+	STORE_DELAY(unlock_time, /* out */ page_unlock);
+	PRINT("check_migr_1_page() returned: %d\n", rc);
+	if (rc == 0)
+		MOVED(src_node, node);
+	else{
+		SAVE_ITC(/* out */ free_time);
+		__free_pages(new_p, 0);
+		STORE_DELAY(free_time, /* out */ page_free);
+	}
+	SAVE_ITC(/* out */ free_time);
+	put_page(old_p);
+	STORE_DELAY(free_time, /* out */ page_free);
+	return rc == 0 ? 1 : 0;
+}
+
+
+/*
+ * The real page migration is done here.
+ *
+ * Arguments:	old:	-> old page structure
+ *		new:	-> new page structure
+ *		node:	Destination NUMA node
+ *		pte:	-> PTE of the page to be moved
+ *
+ * Returns:	Negative values (like -Exxx) indicate errors
+ *
+ * Notes:	- Both the old and the new pages and their "pte_chain"-s have
+ *		  to be locked
+ *		- "mm->page_table_lock" and "mm->mmap_sem" have to be held
+ */
+STATIC INLINE int
+page_migrate(struct page * const old, struct page * const new,
+				struct mm_struct * const mm, pte_t * const pte_p)
+{
+	struct vm_area_struct	*vma;
+	pte_t			pte;
+	vaddr_t			vaddress;
+	DECLARE_ITC_VAR(vma_time);	/* Time for "find_vma()" */
+	DECLARE_ITC_VAR(flush_tlb_t);	/* Time for "flush_tlb_page()" */
+	DECLARE_ITC_VAR(add_lru_time);	/* Time for "lru_cache_add_active()" */
+	DECLARE_ITC_VAR(copy_time);	/* Time for "copy_user_highpage()" */
+	DECLARE_ITC_VAR(upd_mmu_cache);	/* Time for "update_mmu_cache()" */
+
+	if (!PageDirect(old)){
+		PRINT_ERR("Direct mapped pages only\n");
+		ERROR_CNT(page_type_not_supp);
+		return -EFAULT;
+	}
+	vaddress = ptep_to_address(pte_p);
+	PRINT("Virtual addr:\t0x%lx\n", vaddress);
+	SAVE_ITC(/* out */ vma_time);
+	vma = find_vma(mm, vaddress);
+	STORE_DELAY(vma_time, /* out */ find_vma);
+	if (vma == NULL || vma->vm_start > vaddress)
+		panic("\nVMA lost ???\n");
+	/*
+	 * Nuke the page table entry.
+	 */
+	flush_cache_page(vma, vaddress);
+	pte = ptep_get_and_clear(pte_p);
+	SAVE_ITC(/* out */ flush_tlb_t);
+	flush_tlb_page(vma, vaddress);
+	STORE_DELAY(flush_tlb_t, /* out */ flush_tlb);
+	/*
+	 * From now on, the other CPUs cannot touch the content of the page.
+	 * Should they try to, they would observe page faults.
+	 * They pass easily "->mmap_sem" beacause they take it for read, too.
+	 * They queue up in the page fault handler to take "->page_table_lock".
+	 */
+	PRINT("Old ph addr:\t0x%016lx\n", page_to_phys(old));
+	PRINT("Old PTE:\t0x%016lx\n", pte_val(pte));
+	PRINT("_PFN_MASK:\t0x%016lx\n", _PFN_MASK);
+	/*
+	 * Copy some of the page structure.
+	 */
+	new->flags = (new->flags & ~FLAG_MASK) | (old->flags & FLAG_MASK);
+	new->pte.direct = old->pte.direct;
+	SetPageDirect(new);			/* Direct mapped pages only */
+	old->pte.direct = NULL;
+	ClearPageDirect(old);
+	if (PagePrivate(new))
+		new->private = old->private;
+	SAVE_ITC(/* out */ add_lru_time);
+	lru_cache_add_active(new);
+	STORE_DELAY(add_lru_time, /* out */ add_lru);
+	/*
+	 * Here is where the data is copied.
+	 */
+	SAVE_ITC(/* out */ copy_time);
+	copy_user_highpage(new, old, vaddress);
+	STORE_DELAY(copy_time, /* out */ copy);
+	/*
+	 * The new PTE keeps everything but the PFN.
+	 */
+	pte = mk_pte(new, __pgprot((pte_val(pte) & ~_PFN_MASK)));
+	PRINT("New ph addr:\t0x%016lx\nNew PTE:\t0x%016lx\n\n",
+						page_to_phys(new), pte_val(pte));
+	set_pte(pte_p, pte);
+	SAVE_ITC(/* out */ upd_mmu_cache);
+	update_mmu_cache(vma, vaddress, pte);
+	STORE_DELAY(upd_mmu_cache, /* out */ update_mmu_cache);
+	/*
+	 * The old page was "lru_cache_add_active()"-ed e.g. in
+	 * "do_anonymous_page()".
+	 */
+	page_cache_release(old);
+	return 0;
+}
+
+
+/*
+ * Some more tests and go on with the page migration.
+ *
+ * Arguments:	old:		-> old page structure
+ *		new:		-> new page structure
+ *		node:		Destination NUMA node
+ *		pte:		-> PTE of the page to be moved
+ *
+ * Returns:	Negative values indicate errors
+ *
+ * Notes:	- The old page has to be locked
+ *		- "mm->page_table_lock" and "mm->mmap_sem" have to be held
+ */
+STATIC INLINE int
+check_migr_1_page(struct page * const old, struct page * const new,
+				struct mm_struct * const mm, pte_t * const pte)
+{
+	int rc;
+	DECLARE_ITC_VAR(pte_chain_lock_time);	/* Time for "pte_chain_lock()" */
+	DECLARE_ITC_VAR(unlock_time);		/* Time for "unlock_page()" */
+
+	if (PageReserved(old)){
+		PRINT_ERR("What shall I do with a reserved page ?\n");
+		ERROR_CNT(page_type_not_supp);
+		return -ENXIO;
+	}
+	if (PageError(old)){
+		PRINT_ERR("Page has got error(s)\n");
+		ERROR_CNT(errors);
+		return -EIO;
+	}
+	if (!PageUptodate(old)){
+		PRINT_ERR("Page has no valid data ???\n");
+/*		return -EIO;
+ */
+	}
+	if (PageCompound(old)){
+		PRINT_ERR("What shall I do with a compound page ?\n");
+		ERROR_CNT(page_type_not_supp);
+		return -ENXIO;
+	}
+	if (old->mapping != NULL){
+		PRINT_ERR("Anonymous pages only\n");
+		ERROR_CNT(page_type_not_supp);
+		return -ENXIO;
+	}
+	if (PageSwapCache(old)){
+		PRINT_ERR("What shall I do with a page in swap cache ?\n");
+		ERROR_CNT(page_type_not_supp);
+		return -ENXIO;
+	}
+	if (PageHighMem(page)){
+		PRINT_ERR("What shall I do with a HIGHMEM page ?\n");
+		ERROR_CNT(page_type_not_supp);
+		return -ENXIO;
+	}
+	SAVE_ITC(/* out */ pte_chain_lock_time);
+	pte_chain_lock(old);
+	STORE_DELAY(pte_chain_lock_time, /* out */ pte_chain_lock);
+	if (!page_mapped(old)){			/* Actually means "r-mapped" */
+		PRINT_ERR("Page not in r-map\n");
+		pte_chain_unlock(old);
+		ERROR_CNT(page_type_not_supp);
+		return -EFAULT;
+	}
+	lock_page(new);
+	pte_chain_lock(new);
+
+	rc = page_migrate(old, new, mm, pte);
+
+	pte_chain_unlock(new);
+	SAVE_ITC(/* out */ unlock_time);
+	unlock_page(new);
+	STORE_DELAY(unlock_time, /* out */ new_page_unlock);
+	pte_chain_unlock(old);
+	return rc;
+}
+
+
+#if defined(_NEED_STATISTICS_)
+
+/*
+ * Fetch and clear the statistics.
+ * Accessed in a non atomic way. Who cares? Just some statistics :-)
+ */
+STATIC INLINE int
+page_migrate_statistics(const caddr_t vaddress, const int flag)
+{
+	/*
+	 * Assuming all the CPU-s are clocked at the same frequency.
+	 */
+	_statistics.t.cyc_per_usec = local_cpu_data->cyc_per_usec;
+	if (copy_to_user(vaddress, &_statistics, sizeof _statistics) != 0)
+		return -EFAULT;
+	if (flag)
+		memset(&_statistics, 0,sizeof _statistics);
+	return 0;
+}
+
+#endif	/* #if defined(_NEED_STATISTICS_) */
+
+
+#if defined(_TEST_)
+
+/*
+ * Translate a user mode virtual address to a physical one.
+ */
+phaddr_t
+gimme_an_address(const caddr_t vaddress)
+{
+	const struct vm_area_struct	*vma;
+	const pgd_t			*pgd;
+	const pmd_t			*pmd;
+	const pte_t			*pte;
+	phaddr_t			phaddress = -EFAULT;
+
+	PRINT("Virtual addr:\t0x%016lx\n", (vaddr_t) vaddress);
+	down_read(&current->mm->mmap_sem);
+	vma = find_vma(current->mm, (vaddr_t) vaddress);
+	if (vma == NULL || vma->vm_start > (vaddr_t) vaddress){
+		up_read(&current->mm->mmap_sem);
+		return -EFAULT;
+	}
+	spin_lock(&current->mm->page_table_lock);
+	do {
+		pgd = pgd_offset(current->mm, (vaddr_t) vaddress);
+		if (pgd_none(*pgd) || pgd_bad(*pgd))
+			break;
+		pmd = pmd_offset(pgd, (vaddr_t) vaddress);
+		if (pmd_none(*pmd) || pmd_bad(*pmd))
+			break;
+		pte = pte_offset_map(pmd, (vaddr_t) vaddress);
+		if (!pte_present(*pte)){
+			pte_unmap(pte);
+			break;
+		}
+		phaddress = pte_pfn(*pte) << PAGE_SHIFT;
+		pte_unmap(pte);
+	} while (0);
+	spin_unlock(&current->mm->page_table_lock);
+	up_read(&current->mm->mmap_sem);
+	PRINT("Physical addr:\t0x%016llx\n", (long long) phaddress);
+	return phaddress;
+}
+
+#endif /* #if defined(_TEST_) */
+


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Fri Apr 2 09:42:29 2004

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:25 EST