Re: RFC: New diff-delta.c implementation

From: Nicolas Pitre <nico@cam.org>
Date: 2006-04-24 15:27:07
On Sun, 23 Apr 2006, Geert Bosch wrote:

> Note that I sent this code as a RFC, with explicit disclaimers about
> style.

I took the liberty of reworking the coding style, as well as simplifying 
some constructs a bit.

I also fixed all the format encoding and size computations bugs the 
original code has so it works perfectly now.

> So, I did not want to sign off on this code, since I pretty
> much knew there would be some problems with the undocumented
> ("proprietary", according the libxdiff site) file format. In contrast
> the GDIFF fileformat was documented very well, and I have a version
> of this code that works flawlessly with that format.

Note that the GIT encoding is itself different (a bit denser) from the 
xdiff encoding.

> > Regarding your FIXME comment about endianess: I think you are looking
> > for htonl().  Use it to convert the values from host byte order to
> > network byte order (= big endian) and you can get rid of those ugly
> > branches.
> Ah, I'll use that. It's of course a slight change that all processing
> now is big-endian centric, but that might actually even result in
> better code in this case. I'm just assuming any decent system has
> some highly optimized macro for this and will never do a function call.
> This is used in the most performance critical loops, and doing function
> calls here will lead to horrendous performance.

Please just completely drop that word fetch "optimization".  Not only 
does it produce worse assembly code in the end due to the extra loop 
counter and extra shifting in turn increasing register pressure, but it 
also assumes that the target data buffer is always word aligned which is 
not guaranteed (casting a char pointer to an unsigned and dereferencing 
it is not portable since on some architecture such misaligned accesses 
are not allowed).  And it even has worse performance.  For example, with 
word fetch and shift code, packing the Linux kernel archive on my P4 at 
3GHz:

$ git-repack -a -f
...
real    7m5.727s
user    6m33.797s
sys     0m31.394s

And with the word fetch optimization completely ripped out:

$ git-repack -a -f
...
real    6m45.443s
user    6m13.911s
sys     0m31.434s

So you must be extremely careful when trying to implement such kind of 
optimizations.

Now disabling all assert() statements gives:

$ git-repack -a -f
...
real    6m42.478s
user    6m12.031s
sys     0m31.090s

But here comes the sad part.  Even after simplifying the code as much as 
I could, performance is still significantly worse than the current 
diff-delta.c code.  Repacking again the same Linux kernel repository 
with the current code:

$ git-repack -a -f
...
real    4m20.742s
user    3m52.255s
sys     0m28.758s

The final pack is smaller with your code but not significantly: 
117867049 bytes vs 118824550 bytes with the current code, i.e. less than 
1% difference.

You can find attached my current version of your code.

In the mean time I'll try replacing the adler32 hashing with your Rabin 
polynomial hashing into the current code just to see if it helps (and I 
think it will help quite a bit).  The best solution might be a mix of 
both approaches.

Of course there is the issue of reversing the object matching window 
logic in pack-object.c to further improve things, but I consider that an 
orthogonal improvement at this point which both approaches will benefit 
from anyway.


Nicolas

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Received on Mon Apr 24 15:27:46 2006

This archive was generated by hypermail 2.1.8 : 2006-04-24 15:28:03 EST