[PATCH 2.6.13 0/6] MCA/INIT: summary

From: Keith Owens <kaos_at_sgi.com>
Date: 2005-09-11 17:15:33
The patches in the following mails are a rewrite of the MCA/INIT
handlers.  They are ready for inclusion in 2.6.14-rc1.

Changes since last complete spin:

* Remove the requirement that kernel stacks be aligned on KERNEL_STACK_SIZE.
* Remove the serialization of MCA/INIT handlers returning to SAL.  The
  problem looked like a race but was really caused by a broken prom
  doing cacheable accesses to the minstate area.
* Print the cpu number and monarch status in the INIT handler.
* Workaround for broken proms that access the minstate area using
  cacheable addresses.
* Remove the export of the scheduler hooks until we have modular code
  that needs them.
* Remove the final reference to MINSTATE_VIRT.
* Workaround for broken proms that drive all INIT events as slaves.
* Workaround for broken proms that drive all INIT events as monarchs.
* Simplify the termination of the backtrace of the MCA/INIT handlers.

Some background might be useful.  The current MCA/INIT handlers have
several shortcomings :-

(1) Only one MCA stack, so we cannot handle concurrent MCA on multiple

(2) Only one INIT stack, for the monarch.  Slave INIT events never get
    into the C code, which gives no data for the slave processes.

(3) The lack of slave INIT processing also means that some MCA events
    that could normally be recovered may turn into fatal events.  If
    one or more cpus are spinning disabled when an MCA occurs then SAL
    will eventually hit the disabled cpus with a slave INIT event.
    Even if the MCA is recoverable (e.g. DBE in user space), the cpus
    that were hit by INIT are now dead, which makes MCA recovery

(4) A monarch INIT event assumes that it can use the existing stack.
    If the INIT was delivered while the cpu was in physical mode then
    the OS monarch handler gets a recursive error.  Ditto if the kernel
    stack has overflowed.

(5) MCA and INIT stacks are completely non-standard.  You cannot get a
    backtrace nor debug the MCA/INIT handlers.  We even have a special
    entry point in the unwind code just for MCA/INIT.  Only the kernel
    knows about that unwind routine, external code such as libunwind
    does not.

(6) The current code relies on getting data from the MCA/INIT record.
    If we hang trying to retrieve that record then we get no useful
    data.  A side effect of using the MCA/INIT record is that we may
    read a record from an earlier event, it may not have been cleared
    when a second event occurs.

(7) Some horrible assembler code in minstate.h, to handle both the
    normal stacks and the non-standard MCA/INIT stacks.

(8) Only one copy of the SAL to OS state, which prevents multiple cpus
    from returning to SAL.

My MCA/INIT rewrite addresses these problems by :-

(1) Using per cpu MCA stacks.

(2) Using per cpu INIT stacks.

(3) Using a common code path for both monarch and slave INIT events,
    passing in a flag to indicate if the event is monarch or slave.

(4) Neither MCA nor INIT will use any part of the current stack until
    they have verified that it is safe to do so.

(5) MCA/INIT stacks look like normal process stacks.  I can even get a
    backtrace through the MCA/INIT handlers :).  This removes the need
    for the special unwind routine.

(6) All data is obtained from PAL/SAL data areas.  There is no need to
    call SAL to get the record, and the problem of stale data goes

(7) minstate.h is now all virtual mode code.

(8) Each cpu gets its own copy of the SAL to OS state.

The original plan was to treat an MCA/INIT as an interrupt that
switched stacks, even if a cpu was already using a kernel stack.
However that caused problems with the notion of "current", mainly
because the task structure is stored in the stack area.  Separating the
task structure from the rest of the stack was vetoed on performance
grounds, it would require extra TLB entries.  This plan would also have
required changes to unwinders, both in the kernel and in external
packages such as lcrash.

Plan B involves switching to the MCA/INIT stacks, making them look like
normal processes with no dependency on data in other stacks.  The
process that was running at the time of MCA/INIT is converted to look
like a sleeping task, complete with its state at the time of interrupt.
The MCA/INIT stack has a pointer to the interrupted task; in addition
the pid of the interrupted task is placed in the 'comm' field of the
MCA/INIT process for humans to read.  This approach does not require
extra TLBs and it works with the existing unwind code.  The only
downside is that it requires two small hooks in the scheduler code to
adjust the scheduler's notion of "this process is on this cpu".

The following 6 patches contain :-

1) Scheduler hooks to change which process is deemed to be on a cpu.

2) Add an extra thread_info flag to indicate the special MCA/INIT
   stacks.  Mainly for debuggers.

3) Avoid reading the INIT record from SAL during the INIT event.  Just
   tell salinfo.c that a new rcord is available, it will be read and
   processed in a normal context.

4) The bulk of the change.  Use per cpu MCA/INIT stacks.  Change the
   SAL to OS state (sos) to be per process.  Do all the assembler work
   on the MCA/INIT stacks, leaving the original stack alone.  Pass per
   cpu state data to the C handlers for MCA and INIT, which also means
   changing the mca_drv interfaces slightly.  Lots of verification on
   whether the original stack is usable before converting it to a
   sleeping process.

5) Remove the physical mode path from minstate.h.

6) Delete the special case unwind code that was only used by the old
   MCA/INIT handler.


Although we could theoretically handle concurrent MCA with these
patches, MCA is still single threaded by ia64_mca_serialize.  It is not
clear what our model should be for handling concurrent MCA on multiple
cpus, some discussion is required first.

Now that MCA/INIT is recoverable, we will have to address the SCSI
timeouts that occur if interrupts are disabled for long periods.  MCA
can disable interrupts for up to 20 seconds while it does the
rendezvous.  On resume, the timer code tries to bring jiffies in sync
with itc, time runs too fast and we get spurious timeouts.  There is no
point in recovering from MCA if the disk dies as a side effect of the
lost interrupts.  Christoph Lameter is already working on this.

Convert mca_drv.c to use the pt_regs, switch_stack and minstate areas
instead of reading the MCA record.

To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sun Sep 11 17:16:24 2005

This archive was generated by hypermail 2.1.8 : 2005-09-11 17:16:31 EST