[Linux-ia64] SAL error record logging/decoding

From: Bjorn Helgaas <bjorn_helgaas_at_hp.com>
Date: 2003-05-08 09:41:08
The MCA/INIT/CMC/CPE log decoding currently in arch/ia64/kernel/mca.c
has some problems:

	- It doesn't know much about OEM-specific sections.
	- At boot-time, it sometimes takes so long to print
	  the log to the console that the BSP erroneously
	  assumes an AP is stuck.  This sometimes causes
	  *another* MCA.
	- The log goes ONLY to the console, where the output
	  may be lost.

So here's some fodder for discussion.  I don't claim that this is ready
for prime time; I just want to get some feedback on whether this
is a reasonable approach.

The attached patch (against 2.4.21-rc1) makes the raw, binary
error records straight from SAL available via files in /proc:

	/proc/sal/cpu<n>/{mca,init,cmc,cpe}

If you read the file, you get the raw data.  If you write "clear" to
it, you invalidate the current error record (which as I read the spec,
may potentially make another, pending record available to be read).

The idea is that

	- An rc script run at boot-time can save all the logs in
	  files, clearing each afterwards.
	- A user-level analysis tool can decode them as needed
	  (perhaps also run from the same rc script above).
	- The user-level analyzer need not be open-source, if
	  people are worried about IP in the OEM-specific sections.
	- A baseline open-source analyzer can provide at least the
	  functionality available today in the kernel decoder.

So, attached are the kernel patch against 2.4.21-rc1 and a simple
user program ("salinfo") to decode the logs.  Note that the kernel
patch removes the SAL clear_state_info calls from mca.c, so the error
records will be preserved until the user program can read them.
This feels like the right thing to me (only a user program
can know that the logs have been saved somewhere safe), but
no doubt there are issues here.

The user-space analyzer is derived from the current kernel code
in mca.c and should produce identical output.  For now, I left
all the code in the kernel as well, but ultimately it could be
removed.

Bjorn




Received on Wed May 07 16:41:17 2003

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:14 EST