Yet another MCA handler

From: Zoltan Menyhart <Zoltan.Menyhart_AT_bull.net_at_nospam.org>
Date: 2004-01-14 21:42:35
This is the season of the MCA handlers :-)
Let me show you the one that Christian Cotte-Barrot and I wrote...

I'd like to take this opportunity to express our special thanks to Jenna
Hall, she gave us the initial version of the ".S" code and much help,
and also to Mani Ayyar, David Song and Tony Luck for the technical
consultations.

Our handler currently deals with the translation register errors only.
I was to write the code for the recovery for poisoned memory, too,
but I've got no way to provoke this kind of error
( I do not really know what it like is :-) )

The key features of our MCA handler are:

* Everything is CPU local ( an MCA data area is allocated and hooked
  to each "cpuinfo" structure )

* No locks

* No rendezvous
 - Does not seem to work if not all the CPUs are started up,
   i.e. you specify a "maxcpus=<NUM>"...
 - A failed rendezvous is a bad omen to start with
 - The correctable / recoverable MCAs are CPU local businesses
 - All the CPUs can handle MCAs simultaneously

* The translation registers are purged / reloaded unconditionally:
  cheaper than calling SAL_GET_STATE_INFO(MCA)

* Table driven TR purging / reload (except for the kernel stack mapping)

* TRs are all purged before the reloading starts ( an erroneous TR can still
  be in conflict with a freshly purged / reloaded one )

* SAL_CLEAR_STATE_INFO(MCA) is called only for MCAs which have been
  corrected (TR errors). For the others, the recovery will be tempted by
  a fake page fault handler, by the device drivers and by the MCA daemon,
  therefore the SAL MCA log is not cleared here -- future extension :-)

* "Silent" MCA handler: no prints by default ( unless debugging )
  - Output uses locks...

* A bit more serious error / status checking

This patch is against the version 2.6.1 + kdb-v4.3-2.6.1-common-b0.bz2 +
kdb-v4.3-2.6.1-ia64-b0.bz2.

Testing:
- Obviously by use of an ITP
- In my next mail I'll include a patch that can insert an illegal
  translation in a TR provoking an MCA

Problems:
Neither "IA64_LOG_NEXT_BUFFER()" nor "salinfo_log_wakeup()" works :-(
I think some addresses are messed up. The system says it cannot
translate virtual address...

I'll send the patch in the next letter.
Should the list refuse it due to its length, please pick it up at our
anonymous FTP server: ftp://visibull.frec.bull.fr/pub/linux/mca/ 

Your remarks will be appreciated.

Zoltan Menyhart
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Wed Jan 14 05:42:43 2004

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:21 EST