Re: [RFC&PATCH 1/2] PCI Error Recovery (readX_check)

From: Linus Torvalds <torvalds_at_osdl.org>
Date: 2004-08-26 03:25:22
On Wed, 25 Aug 2004, Grant Grundler wrote:
> 
> Do we only need to determine there was an error in the IO hierarchy
> or do we also need to know which device/driver caused the error?
> 
> If the latter I agree with linus. If the former, then the error recovery
> can support asyncronous errors (like the bad DMA address case) and tell
> all affected (thanks willy) drivers.

Yes, that we could do without locking. The simplest way to do that is to 
have a global sequence counter, and then the whole thing really boils down 
to

	typedef unsigned long pci_error_cookie_t;

	extern pci_error_cookie_t error_sequence_number;

	static inline pci_error_cookie_t clear_pci_errors(struct pci_dev *dev)
	{
		pci_error_cookie_t now;

		now = error_sequence_number;
		io_start_error_memory_barrier(dev);
		return now;
	}

	static inline int read_pci_errors(pci_error_cookie_t then, struct pci_dev *dev)
	{
		io_end_error_memory_barrier(dev);
		return (then != error_sequence_number) ? -EIO : 0;
	}

and have the error handler just increment the "error_sequence_number" 
whenever it happens.

However, the above relies on the PCI error being a NMI in the first place
(which it may or may not be), and also on the fact that we need to have
some way to make sure we get it if it was pending to avoid races (ie the
"io_end_error_memory_barrier()" may have to be pretty expensive and
bus-serializing - likely a config space read from the device).

And the above also relies on it being ok to see other peoples errors by
mistake.

(Depending on how much information such a PCI error NMI gives the kernel,
the "error_sequence_number" could be made per-domain or per-bus, of
course, but that's an implementation detail that depends on what the
hardware supports).

But I would not be surprised if "clear_pci_errors()" actually has to 
_clear_ some bit in the bridge device and "read_pci_errors()" has to check 
the bit afterwards. And if that is the case, I really do believe that you 
want to lock the whole bridge, because you can't have people clearing the 
bit when somebody else might be actively using it...

So I think we have to design for the thing potentially being a irq-safe 
spinlock - possibly even a global one. That's the worst case, and maybe a 
lot of hardware platforms can do less intrusive things, but if we're 
looking at generic infrastructure that different drivers are supposed to 
be able to use (and I assume that's what we want), then we have to make 
the interfaces generic.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Wed Aug 25 13:32:34 2004

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:30 EST