Re: Preserving CMC/CPE records across reboot

From: Jack Steiner <steiner_at_sgi.com>
Date: 2006-01-14 02:57:02
On Fri, Jan 13, 2006 at 11:46:29AM +1100, Keith Owens wrote:
> CMC/CPE records (unlike MCA/INIT) are copied into kernel space and
> cleared from NVRAM as soon as they occur.  That decision was made by
> Bjorn Helgaas some years ago.  The idea is that if you do not have
> salinfo_decode or some equivalent program running then the correctable
> errors still need to be deleted from NVRAM.  But if the system hangs
> while reading the CMC/CPE then we get no data at all.
> 
> SGI just had an example of this.  A cpu took a CMC, salinfo_decode
> started running and hung while processing the CMC record, the system
> had to be rebooted.  Because the CMC record had been cleared from NVRAM
> before handing a copy to salinfo_decode, the contents were lost.

On SN, CMC/CPE records are never written to NVRAM. They are saved only
in memory. If the system hangs trying to log a CMC/CPE & the system is reset,
all CMC/CPE records are lost.

It is possible that some of this could be changed but it currently works 
this way. Also, writing error records to NVRAM is slow - something to
avoid on performance critical paths. I suppose we could threshhold the
error rate & would limit the rate of writing to NVRAM.



> 
> We should be able to keep the first few CMC/CPE records for each cpu in
> NVRAM and discard the later ones if we start getting a backlog.  Then
> if the system hangs while processing a CMC/CPE, the data will still be
> available in NVRAM and will be processed on the next boot.  If the
> reboot hangs again in salinfo processing then we have a solid error,
> either cpu or SAL, so switch the offending cpu out of the system.
> 
> Any objections from other platforms?
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sat Jan 14 02:58:07 2006

This archive was generated by hypermail 2.1.8 : 2006-01-14 02:58:15 EST