Re: [patch] Remove limit on MCA recoveries

From: Matthias Fouquet-Lapar <mfl_at_kernel.paris.sgi.com>
Date: 2005-01-16 20:01:43
> On Sat, 15 Jan 2005 16:49:06 -0600 (CST), 
> Russ Anderson <rja@sgi.com> wrote:
> >The MCA recovery driver saves addresses memory errors
> >in an array.  The array has 32 entries.  The effect is 
> >that after 32 recoveries, the driver stops recovering.
> >
> >This patch removes the page_isolate array.  Since the array
> >was only used to see if the page is already marked reserved,
> >check the reserved bit instead of the array.
> 
> lkcd dumps kernel pages marked reserved, so lkcd will try to process
> isolated pages.  We will eventually need to add a new page flag to mark
> faulty pages.

Probably any other dump mechanism should be aware of bad HW pages as well,
so we might be better off to add a flag right away. While we are at it I
would propose to have actually two flags :

  - hard error (which will cause a MCA and should be skipped when taking
                a system dump)
  - soft error (page has encountered SBE, so we might want to avoid future
                allocation, but it can be dumped without causing an MCA)

Then we need to define some API that this information can be saved accross
a reboot, so we don't have to have another process hitting the UCE to find
out that there really is a problem at that location

Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sun Jan 16 04:14:23 2005

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:34 EST