Re: new utility for decoding salinfo records

From: Matthias Fouquet-Lapar <mfl_at_kernel.paris.sgi.com>
Date: 2005-01-12 09:26:06
>   Matthias> systems which have several tera-bytes of main memory, but
>   Matthias> as I said this really is very platform specific
> 
> Probably so.  Still, I think a very interesting systems paper could be
> written that would spell out at least the basic trends/invariants in
> memory error behavior.  Hint, hint... ;-)

I'm actually working on such a paper. The real challenge, as you already
pointed out, is to collect some longer term data. I hope to have something
ready in the summer time frame as it simply takes time to run experiments.
Some testing can be done in environmental stress test chambers, but then
the total sample size is lower. One tool I'm currently looking at would
try predictive error analysis based on the data collected by salinfo.

Some other idea I want to explore is to allow to send a signal to the process. 
(which isn't straight forward ...)
This obviously would only be interesting for on-line diagnostics. 
It would allow the diagnostic to focus on a failing location and see if an 
error is repeatable, if it's data dependant etc. Maybe this feedback mechanism
can help to develop better testing strategies.

(I actually have a test system which has known problem DIMMs)


Thanks

Matthias Fouquet-Lapar  Core Platform Software    mfl@sgi.com  VNET 521-8213
Principal Engineer      Silicon Graphics          Home Office (+33) 1 3047 4127

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Tue Jan 11 17:42:02 2005

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:34 EST