Re: Read *pgd again in vhpt_miss handler

From: Zoltan Menyhart <Zoltan.Menyhart_at_bull.net>
Date: 2006-04-28 17:53:19
Christoph Lameter wrote:
> On Thu, 27 Apr 2006, Zoltan Menyhart wrote:
> 
> 
>>I wanted to use the mm semaphore => no need to walk again the
>>pgd ... pte chain.
> 
> 
> The pgd ... pte chain does not change even without mmap until 
> the usage of the memory area ceases.

It is about about un-mapping a zone while another thread faults
on an address belonging to the same zone.

We have got a

	rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]

chain to walk in the VHPT miss handler.

Having reached somewhere in this chain walking, we have got
the ph. address of the next page in the chain in a register.

Before we can fetch the next item in the chain, "unpredictable
long" time can pass.

In the mean time:
- "free_pgtables()" kills the page we are about to touch.
- Someone re-uses the same page for something else.

As we are still keeping the same ph. address, we fetch an item
from a page that is no more ours.

Even if this security window is small, it does exist.

The probability to hit this bug grows higher on a NUMA machine
with lots of CPUs.

I can accept that the VHPT miss handler cannot protected by
some locks, it is the other end that should use some "careful
un-mapping" in order to avoid race conditions.

Here is what I'm working on:

PTE, PMD and PUD page usage perfectly fits into the RCU approach:

1. The VHPT miss handler is protected by "rcu_read_lock_bh()".
   There is not a single instruction added, the required semantics
   is provided by the fact that the interrupts are off.

2. "free_pgtables()" keeps working as today for the non multi-
   threaded applications.

3. "free_pgtables()" and its subroutines do not actually free
   the PTE, PMD and PUD pages for multi-threaded applications.
   These pages will set free via an "call_rcu_bh()"-activated
   service.

(Perhaps, the weaker protection "rcu_read_lock()" - "call_rcu()"
will be enough...)

Please note that:
- The life span of the PTE, PMD and PUD pages is rather long:
  they are freed when the usage of the memory area ceases,
  provided no other map (using the same PTE, PMD and PUD pages)
  is valid.
- The number of the PTE, PMD and PUD pages is much more smaller
  that that of the leaf pages.
Therefore freeing them is not really performance critical.
As the "call_rcu_bh()"-activated freeing service will do a batch
processing, these is a chance that freeing the PTE, PMD and PUD
pages in this way be more efficient then the "pte_free()"... etc.
services of today are.

Regards,

Zoltan

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Fri Apr 28 17:54:21 2006

This archive was generated by hypermail 2.1.8 : 2006-04-28 17:54:37 EST