Rsync performance - a suggestion

From: barry <barry@disus.com>
Date: 2005-04-14 00:45:37
It seems to me that the rsync performance would be linearly dependent on
(a fraction of ) the number of objects in the repo. Isn't is a goal that
the performance should be uniform regardless of the repo size? 200K
objects assumes the repo just has one "project" in it. If it contains
"mm" and "2.6.12" at the same time, then it might be significantly
bigger.

I have the following suggestion:
There are twp important properties of the git architecture that rsync is
not taking advantage of:
1) object names are idempotent across all servers, and also uniquely
identify the contents. There is no need to look at contents as long as
the name is the same between client and server, (and that the object is
legally formed).
2) previously existing sharing of a higher level object like a commit,
or a tree automatically implies that the children trees and blobs are
already present on both client and server.

A new rsync-like protocol can be then used to very efficiently push or
pull a repository:

1) send a "rev-tree"-like query from client to server to establish a
list of commits that the client wishes to exchange with the server. This
query can take advantage of knowing which heads haves been exchanged in
advance. If you are closely in sync already, then only a small number of
commits will be sent
2) use rsync-ike scheme to get the "deltas" of which commits need to be
tranferred from server<->client, and use rsync itself to copy the commit
objects into RSYNC/.git/objects/*/*
3) create a flattened list of tree objects to be exchanged based on the
commits in (2), and use rsync-like delta detect to only send the
different tree objects into RSYNC/.git/objects/*/*
4) based on the trees sent in (3), to an rsync-like delta of blobs
5) When all the protocol negotiation is correct, and all object
transferred-  to make the RSYNC live, "cp -r RSYNC/.git/objects
LIVE/.git/objects"

Is that reasonable?


Barry Silverman

-----Original Message-----
From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On
Behalf Of Ingo Molnar
Sent: Wednesday, April 13, 2005 2:29 AM
To: H. Peter Anvin
Cc: Petr Baudis; Linus Torvalds; Andrew Morton; git@vger.kernel.org
Subject: Re: incoming



* H. Peter Anvin <hpa@zytor.com> wrote:

> Petr Baudis wrote:
> >
> >I wonder how much it costs in network traffic to just check that a
> >2GB rsync repository is up-to-date?
> >
> 
> It mostly depends on the number of files (objects.)  You obviously 
> have to make a list of them and create a list of missing objects.
> Figure something like 512 bytes of traffic per file to correlate the 
> lists, and the rest is the actual data.

with rsync and DSL, the 'check that there is nothing to sync' time of 
the kernel-test.git repository via rsync is ~25 seconds. That's ~20K 
objects and 200MB of a repository - the expected 200K objects in the 
final (2 GB) repository will then take ~250 seconds (4 minutes). This is

not particularly fast, but not slow either.

the act of checking that there is nothing to do could be optimized 
almost arbitrarily: e.g. calculating an sha1 hash of the rev-tree output

of the desired branch, and comparing it with the local version should 
take less than a second. Or if one is only interested in the 'HEAD' of a

single-project repository, then the HEAD file's content can be used to 
decide whether to rsync the objects hierarchy.

	Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in the
body of a message to majordomo@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Wed Apr 13 07:31:40 2005

This archive was generated by hypermail 2.1.8 : 2005-04-15 12:56:42 EST