Re: [PATCH] MCA recovery from memory read error caused by application

From: Hidetoshi Seto <seto.hidetoshi_at_jp.fujitsu.com>
Date: 2004-02-20 22:00:45
Tony-san,

Thank you for your reply.

> 1) Looking at gr13 in the minstate area when you determined that the
> MCA occurred in usermode will get you whatever value that the user
> happened to have in r13 ... not the task structure.

I couldn't notice it. I'll implement some value check.

> 2) Using "force_sig" in MCA context is probably going to violate some
> locking rules.  I'm making this blanket statement without looking at
> any code ... so I might be wrong ... but doing just about anything
> in an MCA handler runs into locking problems :-( .

Umm... Are there any better way?

> 3) You need some code to clean the error from the page.
> Otherwise the page will be freed when the process terminates,
> and the next user will possibly hit the same error.
> 
> Now some general philosophical issues:
> 
> This implemention is making the assumption that errors are
> soft ... i.e. that there is no permanent problem with the
> memory when an error is reported.  More often than not this
> assumption is true ... transient errors are more frequent
> than hard failures.  But many RAS people prefer to err on
> the side of paranoia, and might prefer to take the page with
> the error out of service.

Agree. The first thing to do is an isolation, and recycling is the next.


Thanks,

H.Seto

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Fri Feb 20 06:01:02 2004

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:22 EST