Re: [Linux-ia64] Re: [Discontig-devel] CLUMPS, CHUNKS and GRANULES

From: Jack Steiner <steiner_at_sgi.com>
Date: 2002-08-20 07:34:00
> 
> Hi Jack,
> 
> thanks very much for the detailed comments. There was (at least) one
> mistake in my previous mail, sorry if that generated confusion. As far
> as I understand, you agree with getting rid of CHUNKS and replacing
> the one macro using them with GRANULE. Also I hope we can limit the
> usage of PXM to be only within arch/ia64/acpi.c.

Agree on both points.


> 
> On Friday 16 August 2002 23:53, Jack Steiner wrote:
> > > things and get rid of some concepts. In my opinion we need only
> > > the following concepts inside DISCONTIGMEM:
> > >  - node IDs (AKA compact node IDs or logical nodes).
> > >  - physical node IDs
> > >  - clumps (I'd prefer the name memory BANKS here, as a clump suggests
> > >  something to be contiguous, without holes (German: Klumpen)).
> > >
> > > In the initialisation phase we need:
> > >  - memory blocks (AKA chunks?) (contiguous pieces of memory on one
> > >  node, provided by ACPI, only used for setup. No size or alignment
> > >  expected. Needed later for paddr_to_nid() but that's all.)
> > >  - proximity domains (only ACPI NUMA setup, invisible to the rest of
> > >  the DISCONTIG code).
> 
> > > Therefore a node would have several memory banks which are not
> > > necessarily adjacent in the physical memory space. There can be gaps
> > > or banks from other nodes interleaved. In the mem_map array there is
> > > space reserved for page struct entries of ALL pages of one bank,
> > > existent or not. Memory holes between banks don't build holes in the
> > > mem_map array.
> >
> > If the mem_map has entries for pages that dont exist, how do you handle
> > code that scans the mem_map array. How does code recognize  & skip pages
> > associated with missing memory?? For examples, see show_mem()
> > & get_discontig_info(). (Maybe I misunderstood your proposal here).
> 
> Sorry, I didn't mean to change anything related to mem_map. I just
> described the current state wrongly. From the following piece of code
> in discontig_paging_init() I thought that for each clump we will have
> PLAT_CLUMPSIZE/PAGE_SIZE struct page entries in the mem_map array. I
> missed the readjustment of npages below...
> 
> 
> 	mycnodeid =  boot_get_local_cnodeid();
> 	for (cnodeid = 0; cnodeid < numnodes; cnodeid++) {
> 		page = NODE_DATA(cnodeid)->node_mem_map;
> 		bdp = BOOT_NODE_DATA(cnodeid)->bdata;
> 		while (bdp->node_low_pfn) {
> 			kaddr = (unsigned long)__va(bdp->node_boot_start);
> 			ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
> 			while (kaddr < ekaddr) {
> 				node_data[mycnodeid]->clump_mem_map_base[PLAT_CLUMP_MEM_MAP_INDEX(kaddr)] = page;
> 				npages = PLAT_CLUMPSIZE/PAGE_SIZE;
> 				if (kaddr + (npages<<PAGE_SHIFT) > ekaddr)
> 					npages = (ekaddr - kaddr) >> PAGE_SHIFT;
> 				for (i = 0; i < npages; i++, page++, kaddr += PAGE_SIZE)
> 					page->virtual = (void*)kaddr;
> 			}
> 			bdp++;
> 		}
> 	}
> 
> Which means: for each memory block we get from the ACPI SRAT table we
> will have struct pages reserved. Not more. Hotpluggable (non-existent?)
> memory should appear here, too, I guess. But that should not harm.
> 
> 
> 
> > On the SGI platform, "physical node number" has a very precise definition.
> > This is not true on all architectures. On SGI, the physical number is bits
> > [48:38] of the physical address. In addition, a system can run with a
> > sparse set of physical node numbers. For example, a 3 node system could
> > have physical node 512, 800 & 2012.
> 
> It's somewhat similar on the NEC Azusa/Asama, where one can get the
> physical node number from the SAPIC_ID of the CPUs.

FWIW, physical node number is also in the sapic_id of the SGI systems too.


> 
> 
> 
> > > > 	proximity domain numbers - these numbers are assigned by ACPI.
> > > > 	Each platform must provide a platform specific function
> > > > 	for mapping proximity node numbers to physical node numbers.
> > >
> > > The proximity domain numbers are unnecessary. They are just other
> >
> > Unfortunately, for SGI hardware,  proximity domain numbers cant be the same
> > as a physical node number. ACPI limits proximity domain numbers to 0..254.
> > On SGI, physical node numbers are 0..2047. Fortunately, we found a way to
> > compress the physical node number into a proximity domain number. In the
> > future, though, our current "trick" may no longer work. If we can get the
> > the proximity domain numbers changed to 0..65K, then I
> > agree that it could be the same as the physical node number.
> > Is there any chance we can get this changed???
> >
> > > (compact) mapping. Only SGI uses the pxm numbers later as:
> > > #define PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) \
> > >   __pa(SN2_KADDR(PLAT_PXM_TO_PHYS_NODE_NUMBER(nid_to_pxm_map[cnode])
> > > but it is clear that what they actually want to do is translate the
> > > compact node id to a physical node id. They just misuse the PXM
> > > translation tables for this. All reference to proximity domain numbers
> > > can be eliminated after the ACPI setup phase. Maybe we need some map
> > > when hotplugging and adjusting a physical->logical translation table,
> > > but not in DISCONTIG.
> 
> The only place where the PXM is used (after ACPI based initialisation) is
> the macro above. And there you do
> cnode ---> PXM number ---> physical node id
> I suggest something trivial: use a physnode_map[] table initialised
> by the ACPI routines which detected the nodes. So just do
> cnode ---> physical node id
> The physnode_map table will have only numnodes entries and we will never
> see PXM outside the ACPI init routines.

Seems ok.
I assume there is still an ACPI platform-specific macro or function to convert
PXM numbers into physical node numbers.


> 
> 
> 
> > > > - Memory is conceptually divided into chunks. A chunk is either
> > > >   completely present, or else the kernel assumes it is completely
> > > >   absent. Each node consists of a number of possibly discontiguous
> > > > chunks.
> > >
> > > When reading the code I get the impression that the concept of a CHUNK
> > > isn't really needed in the code. The definitions are misleading
> > > because they suggest that CHUNKS are equally sized (there is a
> > > CHUNKSHIFT) and we should expect ACPI to give us a bunch of
> > > chunks. But all we really need these for is to check whether a
> > > physical address is valid or to find out to which node a physical
> > > address belongs to. When building the mem_map and the page struct
> > > entries we need to know whether a page is inside a valid memory block
> > > or not, no matter how this memory block looks like, how big it is
> > > of whether it fits into one clump or not. On Azusa a chunk returned by
> > > ACPI can span the whole node memory, thus the rule: "a clump is made
> > > of chunks" is not valid.
> >
> > Agree that CHUNK is barely used. I think the way GRANULE is being used, it
> > may replace the need for CHUNKs.
> 
> Fine! Then we can get rid of CHUNKS and PXMs in the mmzone*.h files?

PXM, yes.

I also think we can get rid of CHUNKs.  If we decide we need need to support 
kern_addr_valid(), we will need to use GRANULE the same way as we currently use 
CHUNK. However, I think it is safe to wait until we understand the
requirement for kern_addr_valid() (as long as we know how to do it).


> 
> > However, since IA64 doesnt current implement a kern_addr_valid() function,
> > CHUNK is not currently used.
> >
> > Do you know if kern_addr_valid() for IA64 is planned in the future???
> No idea. Do you think we need such a thing? We lived without it for quite
> a while...

No idea either...


> 
> > > > - each node has a single contiguous mem_map array. The array contains
> > > > page struct entries for every page on the node. There are no "holes" in
> > > > the mem_map array. The node data area (see below) has pointers to the
> > > > start of the mem_map entries for each clump on the node.
> > >
> > > The mem_map array is the same on each node, copied from the boot_node
> > > to all other nodes. It contains page_struct entries for ALL pages on
> > > ALL nodes (if I interpret discontig_paging_init() correctly). The
> > > first two sentences need to be reformulated.
> >
> > I think the first two sentences are correct, but the last one is
> > misleading. Is this better:
> >
> > 	- each node has a single contiguous page_struct array. This array contains
> > page struct entries for every page that is actually present on the node
> > There are no "holes" in the page_struct array for non-existent memory. Note
> > that adjacent entries in the array are NOT necessarily for contiguous
> > physical pages if there are multiple non-contiguous clumps on the node.
> >
> > 	  The node data area (see below) has pointers to the start of the
> > page_struct entries for each clump on the node.
> 
> Agreed. I misunderstood discontig_paging_init.
> 
> 
> > > > 	PLAT_BOOTMEM_ALLOC_GOAL(cnode,kaddr) - Calculate a "goal" value to be
> > > > passed to __alloc_bootmem_node for allocating structures on nodes so
> > > > that they dont alias to the same line in the cache as the previous
> > > > allocated structure. You can return 0 if your platform doesnt have this
> > > > problem.
> > I'm not real happy with this solution, but I think it works. To verify it,
> > I added a printk right after the point in discontig.c that does the
> > allocate:
> >
> >
> > Looks ok, although the node 0 allocation is not necessarily ideal.
> 
> OK, looks reasonable. Will think of something similar for DIG64
> platforms.
> 
> 
> 
> > Consistency in naming is what is important. We should all agree on the
> > terminology & variable naming conventions. We also need to be clear that
> > maximum node node is NOT the same as NR_NODES-1.
> >
> > If I understand your proposal,
> >
> > 	locical nodes are:
> > 		values: 0..NR_NODES-1.
> > 		names are (pick one) node, nodenum, lnode, cnode, ...
> >
> > 	physical nodes:
> > 		values: are 0 .. ???
> > 		names: pnode, physnode, ....
> >
> > Lets pick the names we want to use.
> 
> OK. I like node & physnode (the later is pretty rare, I guess).
> 
> > > > 	PLAT_MAX_NODE_NUMBER - maximum physical node number plus 1
> > >
> > > And this one should be MAX_PHYS_NODES or NR_PHYS_NODES.
> >
> > These names are confusing. For example, the SGI SN2 system has
> > 	maximum number of nodes is 128
> > 	maximum node number 2047
> >
> > According to the current discontig patch for SN2:
> > 	PLAT_MAX_NODE_NUMBER = 2048
> > 	PLAT_MAX_COMPACT_NODES = 128
> >
> > (Note: I dont object to changing names, but we need both abstractions).
> 
> I understand your objection. For the first macro we need to remind
> readers that it is the highest possible physical node ID in the system.
> PLAT_MAX_PHYSNODE_NUMBER sounds too long, so maybe:
>     MAX_PHYSNODE_ID = 2048   ?
> Anyway, this one is only used for the physical_node_map[] which maps
> a physnode to a cnode ID, so maybe we shouldn't worry too much about the
> name.
> Anyway, instead of PLAT_MAX_COMPACT_NODES I'd prefer
>      NR_NODES     = 128
> or
>      MAX_NUMNODES = 128,
> whatever is used on other architectures.

OK.


> 
> Best regards,
> Erich
> 
> 


-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com
Received on Mon Aug 19 14:34:44 2002

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:10 EST