It seems to me that the rsync performance would be linearly dependent on (a fraction of ) the number of objects in the repo. Isn't is a goal that the performance should be uniform regardless of the repo size? 200K objects assumes the repo just has one "project" in it. If it contains "mm" and "2.6.12" at the same time, then it might be significantly bigger. I have the following suggestion: There are twp important properties of the git architecture that rsync is not taking advantage of: 1) object names are idempotent across all servers, and also uniquely identify the contents. There is no need to look at contents as long as the name is the same between client and server, (and that the object is legally formed). 2) previously existing sharing of a higher level object like a commit, or a tree automatically implies that the children trees and blobs are already present on both client and server. A new rsync-like protocol can be then used to very efficiently push or pull a repository: 1) send a "rev-tree"-like query from client to server to establish a list of commits that the client wishes to exchange with the server. This query can take advantage of knowing which heads haves been exchanged in advance. If you are closely in sync already, then only a small number of commits will be sent 2) use rsync-ike scheme to get the "deltas" of which commits need to be tranferred from server<->client, and use rsync itself to copy the commit objects into RSYNC/.git/objects/*/* 3) create a flattened list of tree objects to be exchanged based on the commits in (2), and use rsync-like delta detect to only send the different tree objects into RSYNC/.git/objects/*/* 4) based on the trees sent in (3), to an rsync-like delta of blobs 5) When all the protocol negotiation is correct, and all object transferred- to make the RSYNC live, "cp -r RSYNC/.git/objects LIVE/.git/objects" Is that reasonable? Barry Silverman -----Original Message----- From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On Behalf Of Ingo Molnar Sent: Wednesday, April 13, 2005 2:29 AM To: H. Peter Anvin Cc: Petr Baudis; Linus Torvalds; Andrew Morton; git@vger.kernel.org Subject: Re: incoming * H. Peter Anvin <hpa@zytor.com> wrote: > Petr Baudis wrote: > > > >I wonder how much it costs in network traffic to just check that a > >2GB rsync repository is up-to-date? > > > > It mostly depends on the number of files (objects.) You obviously > have to make a list of them and create a list of missing objects. > Figure something like 512 bytes of traffic per file to correlate the > lists, and the rest is the actual data. with rsync and DSL, the 'check that there is nothing to sync' time of the kernel-test.git repository via rsync is ~25 seconds. That's ~20K objects and 200MB of a repository - the expected 200K objects in the final (2 GB) repository will then take ~250 seconds (4 minutes). This is not particularly fast, but not slow either. the act of checking that there is nothing to do could be optimized almost arbitrarily: e.g. calculating an sha1 hash of the rev-tree output of the desired branch, and comparing it with the local version should take less than a second. Or if one is only interested in the 'HEAD' of a single-project repository, then the HEAD file's content can be used to decide whether to rsync the objects hierarchy. Ingo - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.htmlReceived on Wed Apr 13 07:31:40 2005
This archive was generated by hypermail 2.1.8 : 2005-04-15 12:56:42 EST