Re: [RFC] Better MCA recovery on IPF

From: Matthias Fouquet-Lapar <mfl_at_kernel.paris.sgi.com>
Date: 2003-11-08 18:36:07
> > I can estimate what the procedure includes, such as changing 
> > poisoned memory to uncacheable, clearing suspect data in cache, and storing 
> > zeros to the poisoned area.
> 
> There is no way to tell if the error is soft/transient
> and can be cleared by that sequence, or hard/permanent.

I think there is. Depending on your chipset you can re-read the memory
uncached after all outstanding references have terminated. If you don't
get the same error, it is transient. 

Since I would expect that the majority of errors to be transient, I think
this really is the right approach. Again, depending on the chipset architecture
you might want to do some uncached write/reads ("micro-diagnostics") to
see if the problem can be identified to confirm the nature of the problem.

I used similar approaches on other architectures when figuring out if
a Single Bit was transient or hard. The goal was to stop triggering for SBEs
once you know that you have a hard SBE due to the large overhead

> The safest option is to simply take the page with
> the error out of service and not re-use it.

One problem might be that you now miss a page of main memory and it might
require an additional TLB entry if you use large memory segments

- Matthias

> 
> -Tony
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sat Nov 8 02:39:09 2003

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:20 EST