Re: [PATCH] ptrace RSE bug

From: Roland McGrath <>
Date: 2007-11-14 22:09:14
> Roland McGrath wrote:
> > What's arch_ptrace_resume about?  
> > I thought we were agreed on the plan using TIF_RESTORE_RSE.
> No, after further discussion we came to the conclusion that introducing
> the bit actually saves us only a couple of user-to-kernel/kernel-to-user
> copies in do_exit(), but it complicates the kernel exit path, so it's
> not really worth it. Shaohua didn't call it arch_ptrace_resume(), but he
> added an argument to arch_ptrace_stop(). I was already testing a
> different variant of that patch and I'm sorry if it caused confusion. :(

I think I missed that part of the discussion, but I may have overlooked it.
I'd thought Shaohua (or his predecessors) were already clear from my end on
why to argue for TIF_RESTORE_RSE.  Not being an actual ia64 user myself,
the plan was to outsource the arguing with ia64 people to the Intel folks.
So much for that.

It's not just do_exit, where I presume you mean for the ptrace EXIT stop,
after which user mode will never run again.  It's also any time ptrace_stop
runs more than once before going back to user mode.  This includes a signal
stop that is followed by more signal stops, a syscall-exit stop followed by
signal stops, a ptrace_notify (clone et al) followed by another (vfork-done
follows clone) or by signal or syscall-exit stops, etc.

But optimizing those cases is not really what motivates me.  What you've
implemented is pretty much exactly what David and I settled on when we
first discussed this a few years ago.  The recent action on the subject
was spurred by a slightly more recent set of interests on my part.

> Is there any other advantage in introducing TIF_RESTORE_RSE than saving
> some unneeded data copying?

No, unless "other" includes "saving a lot of very unneeded data copying".
Given ptrace, I can see the argument for simplicity over optimization.
The TIF_RESTORE_RSE plan looks forward to future debugging facilities,
where this issue could be a large performance impediment that other
machines won't have.

I became aware of this issue a long time ago because of gdb's inability
to use /proc/pid/mem reliably as it can do on all other machines.  But
when I started pressing to resolve it was while working on utrace.
Regardless of the fate of utrace per se, I think something will arise
that has the same requirements on arch code that I've set down and that
motivated the TIF_RESTORE_RSE approach to ia64's register backing store.

Consider a flexible facility for tracing actions at the kinds of event
points that ptrace monitors today.  Where now ptrace_notify or
ptrace_stop is called, a hook into some moderate intelligence can run to
decide whether to stop and what to communicate to the debugger and so forth.

It might monitor all syscalls and usually decide to do nothing at all,
or send a very cheap asynchronous notification somewhere.  The hoped-for
promise of fancy new facilities is that they can do this unobtrusively
across many, many threads in the system.  So in this case, it should be
no more expensive than TIF_SYSCALL_AUDIT.

It might stop and wait with a clunky amount of overhead like ptrace.
In this case, the additional overhead of extra copies probably doesn't rate.

It might not stop at all, but instead do some self-examination before
going on.  This might include reading (or writing) memory via
access_process_vm or get_user, including accessing register values in
the register backing store memory.  In this case, correctness demands
that the writeback to user memory be done before that memory is
examined, and that changed user memory will be reloaded into the RSE
before the user registers are used next (in user mode or at syscall entry).

But it's a flexible facility.  So it's not easy to know ahead of time
which of these scenarios it will be.  (In utrace, there's a function
pointer provided by some kernel module that can do what it likes.)
For the filtered event case (first of the three), you want to skip the
whole overhead.  For the ptrace-like case (second), you want to flush
before you stop and reload afterward.  For the non-stop introspection
case (third), you need flush and reload but here won't be any stop.

So the natural interface for these to all fall out optimal is that there
is no automagic copying at the low level, but an explicit arch
"writeback" function to call.  This gets called before something reads
user memory and expects it to be harmonized with the thread's state.  It
should be cheap to call when it's already been done, so multiple
uncoordinated things can request it when they need to ensure it, but
don't pay lots of extra overhead when several of those come before
actually going to user mode (or syscall entry).  To keep the bookkeeping
and interactions simple even among multiple uncoordinated things, that
one call should make it so that reloading from user memory is automatic
on the next return to user mode (or syscall entry).  ptrace calls this
at every stop (and attach, cf my earlier mail).  Other future facilities
would call it only selectively when they've decided they need register
values from memory at this particular stop.

The bookkeeping to prevent repeated flushes and to ensure reloads before
resuming could be done by higher layers that call ia64_ptrace_stop and
ia64_ptrace_resume.  But, this whole issue really only exists on ia64.
So it makes sense to keep the generic interface semantics that has to
consider it as simple as possible and push this work into the ia64 code,
and TIF_* is a perfect fit for this.  The main "complexity" introduced
is the overloading of TIF_NOTIFY_RESUME because there isn't another free
bit in TIF_ALLWORK_MASK.  The need to reload from memory after
syscall_trace_enter and before using the register values as syscall args
is subtle, too.  But still, it's not so much.  And frankly, ia64
deserves it for deciding to have the godforsaken RSE semantics.


To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to
More majordomo info at
Received on Wed Nov 14 22:09:32 2007

This archive was generated by hypermail 2.1.8 : 2007-11-14 22:09:52 EST