Re: new utility for decoding salinfo records

From: Matthias Fouquet-Lapar <mfl_at_kernel.paris.sgi.com>
Date: 2005-01-12 08:36:52
>   Ben> We believed rates of SBEs in the neighborhood of 1/hr would
>   Ben> ultimately lead to MBEs but further testing has shown that we
>   Ben> really don't see DIMMS with SBEs turing in MBEs.
> 
> That's very interesting.

We have seen both : hard SBEs which never end up in a UCE and bursts of
SBEs which will lead to UCEs. It is DRAM vendor specific and it depends
in which phase of the chips life cycle the error occurs (infant mortality
or not). 

Another important data point is if the error is "soft", i.e. after a scrub
operation it's corrected (probably caused by an alpha particle hit) or
"hard", i.e. the error still is there after the memory location has been
re-written.

I think as long as it is possible to log all errors, the following toolchain
can be adopted. Depending on the system infrastructure it might be useful
to capture additional information such as :
  
  - data pattern including ECC
  - environmental conditions (voltage, temperature)
  - DIMM serial numbers

The later is becoming a real issue when dealing with systems which have
several tera-bytes of main memory, but as I said this really is very 
platform specific


Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Tue Jan 11 16:51:38 2005

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:34 EST