Re: perfmon2 vector argument question

From: Andrew Morton <>
Date: 2006-06-29 13:17:08
On Mon, 19 Jun 2006 13:40:12 -0700
Stephane Eranian <> wrote:

> Hello,
> The current perfmon2 API allows applications to pass vectors of arguments to
> certain calls, in particular to the 3 functions to read/write PMU registers. 
> This approach was chosen because it is very flexible and allows applications
> to modify either multiple or a single register in one call. It is extensible
> because there is no implicit knowledge of the actual number of registers supported
> by the underlying hardware.
> Before entering the actual system call, the argument vector must be copied
> into a kernel buffer. This is required by convention for security and also
> fault reasons. The famous copy_from_user() and copy_to_user() are invoked.
> This must be done before interrupts are masked.
> Vectors can have different sizes depending on the measurement, the PMU model.
> Yet, the vector must be copied into a kernel-level buffer. Today, we allocate
> the kernel-memory on demand based on the size of the vector. We use
> kmalloc/kfree. Of course, to avoid any abuse, we limit the size of the
> allocated region via a perfmon2 tunable in sysfs. By default, it is set
> to a page.
> This implementation has worked fairly well, yet it costs some performance
> because kmalloc/kfree are expensive (especially kfree). Also it may seem
> overkill to malloc a page for small vectors.
> I have run some experiments lately and they verified that kmalloc/kfree and
> copy to/from user account for a very large portion of the cost of calls with
> multiple registers (I tried with 4). For the copies it is hard to avoid
> them. One thing we could do is to try and reduce the size of the structs.
> Today, both struct pfarg_pmd and struct pfarg_pmc have reserved fields
> for future extensions so that we can extend without breaking the ABI.
> It may be possible to reduce those a little bit.
> There are several ways to amortize or eliminate the kmalloc/kfree. First of
> all, it is important to understand that multiple threads may call into a 
> particular context at any time. All they need is access to the file descriptor.
> An alternative that I have explored is to start from the hypothesis that
> most vectors are small. If they are small enough, we could avoid the
> kmalloc/kfree by using a buffer allocated on the stack. One could say
> if the vector is less than 8 elements, then use the stack buffer. If not, then
> go down the expensive path of kmalloc/kfree. I tried this experiment and got
> over 20% improvement for pfm_read_pmds(). I chose 8 as the threshold. The
> downside of this approach is that kernel stack space is limited and we should
> avoid allocating large buffers on it. The pfarg_pmd struct is about 176 bytes
> whereas pfarg_pmc_t is about 48 bytes. With 8 elements we reach 1408 bytes and
> this is true for all architectures including i386 where default kernel stack
> is 2 pages (8kB). Of course, the stack buffer could be adjusted per object
> type and per-architecture. The downside is that if you need to use kmalloc
> the stack space is still consumed.
> It is important to note that we cannot use a kernel buffer of one element and simply
> loop over the vector. Because the copy_from/copy_to must be done without locks nor
> interrupt masked. So one  would have to copy, lock, do the perfmon call, unlock, copy
> and loop for the next element.
> Another approach that was suggested to me is to allocate on demand but not kfree
> systematically when the call terminates. In other words, we amortize the cost
> of the allocation by keeping the buffer around for the next caller. To make
> this work, we would have to decompose the spin_lock_irq*() into spin_*lock()
> and local_irq_*able() to avoid a race condition. For the first caller the
> buffer would be allocated to fit the size (up to a certain limit like today).
> When the call terminates, the buffer is kept via a pointer in the perfmon
> context. The next caller, would check the pointer and size, if the buffer
> is big enough, copy_user could proceed directly, otherwise a new buffer would
> be allocated. That would also work assuming it is OKAY to copy_user with some locks
> held. I can see one issue with this approach as some malicious user could create
> lots of contexts and make one call for each to max out the argument vector limit for
> each. If you have 1024 descriptors and the limit is 1 page/context, it could allocate
> 1024 kernel pages (non-pageable) for nothing. Today, we do not have a global tuneable
> for the argument vector size limit. Adding one would be costly because multiple threads
> could potentially contend for it and therefore we would need yet another lock.
> I do not see another approach at this point.
> Does someone have something else to propose?
> If not, what is your opinion of the two approaches above?

The first approach should be fine - we do that in lots of places, such as
in core_sys_select().

Applications mut be calling this thing at a heck of a rate for kfree()
overhead to matter.  I trust CONFIG_DEBUG_SLAB wasn't turned on...
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to
More majordomo info at
Received on Thu Jun 29 13:27:29 2006

This archive was generated by hypermail 2.1.8 : 2006-06-29 13:27:39 EST