Re: [PATCH] add a clear_pages function to clear pages of higher order

From: Denis Vlasenko <vda_at_port.imtp.ilyichevsk.odessa.ua>
Date: 2005-03-18 20:54:37
On Thursday 17 March 2005 03:33, Christoph Lameter wrote:
> On Fri, 11 Mar 2005, Denis Vlasenko wrote:
> 
> > Andi Kleen (iirc) says that non-temporal stores seem to be
> > big win in microbenchmarks (and I second that), but they are
> > a net loss when we are going to use zeroed page just after
> > zeroing. He recommends avoid using non-temporal stores
> >
> > With this new page prezeroing infrastructure, that argument
> > most likely is not right anymore. Especially clearing of
> > high-order pages definitely will benefit from NT stores
> > because they do not kill L1 data cache in the process.
> >
> > I don't have K8 and therefore cannot be 100% sure, but
> > I really doubt that K8 optimize "rep stosq" into _NT_ stores.
> 
> Hmm. That would be interesting to know and may be necessary to justify
> the continued existence of this patch. I tried to get some numbers on
> the performance wins for zeroing larger pages with the patch as is (no
> NT stores) and came up with:
> 
> Processor				Performance Increase
> ----------------------------------------------------------------
> Itanium 2 1.3Ghz M1/R5			1.5%
> AMD Athlon 64 3200+ i386 mode		3%
> AMD Athlon 64 3200+ x86_64 mode		3.3%
> 
> (this is if the zeroing engine is the cpu of course. Prezeroing
> may be done through some DMA gizmo independent of the cpu)
> 
> Itanium has more extensive optimization capabilities and
> seems to be able to better cope with the loop logic for regular
> clear_page. Thus the improvement is even less on Itanium.
> 
> Numbers obtained with the following patch that allows to get performance
> data from /proc/meminfo on zeroing performance (just divide Cycles by
> Pages for clear_page and clear_pages):

Here is a patch which allows to try different page zeroing
optimizations to be tested at runtime via sysctl.
Was run tested in 2.6.8 time. Rediffed to 2.6.11.
Feel free to adapt to your patch and test.

Also attached is a tarball for microbenchmarking routines. There are two
result files. Duron:

               normal_clear_page - took  8644 max, 8400 min cycles per page
             repstosl_clear_page - took  8626 max, 8418 min cycles per page
                 movq_clear_page - took  8647 max, 8300 min cycles per page
               movntq_clear_page - took  2777 max, 2720 min cycles per page

And amd64:
               normal_clear_page - took  9427 max, 5781 min cycles per page
             repstosl_clear_page - took  9305 max, 5680 min cycles per page
                 movq_clear_page - took  6167 max, 5576 min cycles per page
               movntq_clear_page - took  5456 max, 2354 min cycles per page

NT stores are not about 5% increase. 200%-300%. Provided you are ok with
the fact that zeroed page ends up evicted from cache. Luckily, this is exactly
what you want with prezeroing.
--
vda


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Received on Fri Mar 18 04:59:01 2005

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:37 EST