Re: Compression and dictionaries

From: Jon Smirl <jonsmirl@gmail.com>
Date: 2006-08-17 00:43:30
On 8/16/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Wed, 16 Aug 2006, Shawn Pearce wrote:
>
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > Hi,
> > >
> > > On Wed, 16 Aug 2006, Jon Smirl wrote:
> > >
> > > > On 8/16/06, John Rigby <jcrigby@gmail.com> wrote:
> > > > > Sorry if this is off topic, but could the dictionary be used to make
> > > > > git-grep alot faster?
> > > >
> > > > It would be almost instant.
> > >
> > > But only if you are not using a regular expression, but a single word.
> >
> > Yes and no.  If the inverted index contains terms broken by some
> > known pattern (e.g. break on word-type boundaries) and the regex
> > in question has constant sections (it should, otherwise it might
> > as well just be '.') then you can reduce your search space to a
> > fraction of the overall data by looking at the inverted index to
> > select likely terms, select the related revisions containing those
> > possible terms, then run the regex only on those revisions.
> >
> > Sure you would be possibly pulling out a number of false positives
> > but if the constant sequence(s) in the regex reduce your search
> > space to below 1/2 of the overall data that's probably a lot less
> > I/O and CPU required to complete the query, even if you have to
> > read the entire dictionary and apply each term in the dictionary
> > to the regex to look for those possible matches.
>
> So it would speed up the search, but no, in case of regular expressions,
> particularly any interesting one, the result would not be instantaneous.

Instant is a relative term. Google is instant compared to running grep
over 10TB of data. How long would that take, a month?

Shawn is correct, the inverted indexes are used to eliminate as many
files as possible. So the response time is a more of a function of how
many hits you have instead of how big the data set is. Of course if
you give it a pattern that matches everything it will just as slow as
grep. Give it a pattern that is only in one file and detectable by the
index and it will be very fast. If you are going to give it a bunch of
patterns that aren't in the index, then we need to adjust how the
index is built.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Thu Aug 17 00:44:57 2006

This archive was generated by hypermail 2.1.8 : 2006-08-17 00:45:47 EST