Re: [RFC/PATCH] pfn_valid() more generic : intro[0/2]

From: Hiroyuki KAMEZAWA <>
Date: 2004-10-06 17:33:52

Luck, Tony wrote:
>>ia64's ia64_pfn_valid() uses get_user() for checking whether a 
>>page struct is available or not. I think this is an irregular 
>>implementation and following patches
>>are a more generic replacement, careful_pfn_valid(). It uses 2 
>>level table.
> It is odd ... but a somewhat convenient way to make check whether
> the page struct exists, while handling the fault if it is in an
> area of virtual mem_map that doesn't exist.  I think that in practice
> we rarely call it with a pfn that generates a fault (except in error
> paths).

I understand it's rare case.
Honestly, this patch is for no-bitmap buddy allocator (I posted before).
pfn_valid() returns 0 in many case in no-bitmap buddy allocator
(because MAX_ORDER is 4GB).
So I decided to write experimental pfn_valid() which doesn't cause fault.

> How big will the pfn_validmap[] be for a very sparse physical space
> like SGI Altix?  I'm not sure I see how PFN_VALID_MAPSHIFT is 
> generated for each system.
PFN_VALID_MAPSHIFT can be overwritten in each asm-xxx/page.h. (can be in config.h)
I think each special architecture can find suitable value, if it wants.
If Altrix has XXX Tbytes for each node, setting 1 cache line(64bytes=32entry) covers
each node's maximum size will be good.

1st level table.
With current configuration, 1Gbytes per 2byte, 8Tbytes per 1 page(16kpages)

2nd level table.
1 entry per 8 bytes. Entries are coalesced with each other as much as possible.
If memory layout is like a bee's nest, careful_pfn_valid() will need great amount
of memory and cannot work fine because of searching.

BTW, how sparse SGI Altix ?

> Why do we need a loop when looking in the 2nd level?  Can't the
> entry from the 1st level point us to the right place?
consider this case.

a 1st level entry covers 0x1000 - 0x2000
[valid range          ]  0x1000 - 0x1100
                          0x1200 - 0x1500
                          0x1600 - 0x2000

             -> by 1st level, we get 0x1000-0x1100
                              into loop  0x1200-0x1500
                                         0x1600-       returns 0.

walking 2nd level table can reduce size of 1st table.
I'd like to avoid cache-miss rather than avoiding small walk.

- Kame

To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to
More majordomo info at
Received on Wed Oct 6 03:29:25 2004

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:31 EST