Re: [patch] fix per-CPU MCA mess and make UP kernels work again

From: Jack Steiner <>
Date: 2005-02-05 03:24:22
On Fri, Feb 04, 2005 at 02:00:15PM +1100, Keith Owens wrote:
> On Thu, 3 Feb 2005 20:09:57 -0600, 
> Jack Steiner <> wrote:
> >On Thu, Feb 03, 2005 at 05:48:26PM -0600, Russ Anderson wrote:
> >> According to the SAL Spec, MCAs are supposed to be handled
> >> one at a time.  
> >
> >It has been a long time since I looked, but I thought the
> >spec allowed either implemention, ie. serialize OR all-at-once.
> >
> >Maybe I'm remembering the error handling guide but I know
> >I have seen this somewhere.....
> It is ambiguous.  Extracts from SAL spec.
> 4.1.1 says only one processor gets OS_MCA.
>   When multiple processors experience machine checks simultaneously,
>   SAL selects a "monarch" machine check processor to accumulate all the
>   error records at the platform level and continue with the machine
>   check processing. "Monarch" status is relevant only for the current
>   MCA error event.
> 4.7.2 (5) also says only one processor.
>   5. SAL selects a monarch for handling the error. All slaves
>      processors in SAL_MC_RENDEZ check in their status with the SAL on
>      the monarch.
> But the last sentence of 4.7.2 (8) refers to multiple processors in OS
> MCA.
>   8. SAL finishes the MCA handling on all the processors that are in
>      MCA and waits for all the processors in MCA to synchronize before
>      branching to OS MCA for further processing.  Note that the
>      hand-off to OS MCA from SAL MCA occurs simultaneously on all
>      processors executing in SAL MCA handler.
> 4.7.2 (9) lets the OS choose the monarch, which implies that more than
> one cpu can be in OS MCA handler.
>   9. OS_MCA may choose a monarch processor to continue with error
>      handling. After OS_MCA completes the error handling, the monarch
>      processor wakes up all the slaves through a wake-up message as
>      shown by (9) in Figure 4-4
> The end of 4.7.3 also implies that OS MCA handler can be running on
> multiple cpus. Note 'on all the processors'.
>   When multiple processors experience machine checks simultaneously,
>   SAL selects a monarch machine check processor to accumulate all the
>   error records at the platform level. Once this is done, the OS_MCA
>   procedure will take control of further error handling on all the
>   processors that experienced the machine checks. The OS_MCA layer may
>   need to implement a similar monarch processor selection for the error
>   recovery phase. The operating system will be aware of which
>   processors invoked the SAL_MC_RENDEZ procedure in response to the
>   MC_rendezvous interrupt or the INIT signal and shall wake up those
>   processors.

To further muddy the waters, it looks like the latest Error Handling Guide
has addressed the issue:

>> IntelĀ® ItaniumĀ® Processor Family Error Handling Guide April 2004
>> Document Number: 249278-003
>> 2.7.1
>> ...
>> The MCA error information is provided to the OS_MCA layer. The MCA
>> error record is logged to the NVM.  To simplify SAL implementation, it
>> is strongly recommended that SAL process all MCAs by handing off to the
>> OS as soon as possible to prevent some OSes from experiencing time-outs
>> and potentially crashing the system. >>>> The SAL may maintain a variable in
>> the SAL data area that indicates whether SAL, on one of the processors,
>> is already handling an MCA. If so, MCA processing on other processors will
>> wait within the SAL MCA handler until the current MCA is processed. This
>> situation may arise when local MCAs are experienced on multiple
>> processors. <<<<<<<

However, it says "may maintain a variable...".  Should I interpret this as 
allowing but not requiring serialization?


Jack Steiner (          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to
More majordomo info at
Received on Fri Feb 4 11:27:10 2005

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:35 EST