[Xapian-discuss] b+tree to cdb

Mon Sep 27 15:59:28 BST 2004

On Mon, Sep 27, 2004 at 02:46:27AM -0600, gervin23 wrote:
> this was querying just after building a fresh index. yeah, i did notice 
> significant speedups the more i used it, though i wasn't sure what was 
> going on :) this explains it, thanks.

I'll add a similar note to scalability.html.

> these particular tests were pure phrases i.e. "president johnson".

Oh, there's one other phrase searching technique we don't currently use
but could.  It's nice because it addresses the worst case cost of phrase
searching.

The idea is to index pairs of adjacent words.  But because the number of
different pairs which occur is so vast (much larger than the number of
different words) you hash the pairs.  This will generate a few false
matches, but it's cheap to eliminate those which don't contain both the
words from the phrase by making the underlying search:

president AND johnson AND HASH_OF("president johnson")

And then check the results of this to see which actually contain the
phrase.

Richard and I came up with this scheme a couple of years ago, but
apparently it (or something very similar) is a commonly used to
implement phrase searching.

> yeah, this sounds like a job for down the road :) i'm already very 
> impressed with xapian using my little dataset but was wondering what 
> kind of possiblities there might be with 10+ million docs when something 
> that big comes my way. certainly splitting the indexes up will help as 
> mentioned in the help docs.

Splitting helps indexing, but may actually slow retrieval a little.
I've not benchmarked it recently, and the effect may be negligible.

> >But get_mset only actually gets the document ids and weights - document
> >data, etc is lazily fetched, so if you don't ask for it, you don't incur
> >any extra overhead.
> >
> >Getting 10 matches is faster not because we're reading less document
> >data but because we're able to deploy our shortcutting optimisations
> >more often and more effectively.
> 
> after issuing get_mset(0,10), are all the document id's thrown out (or 
> unavailable) after the first 10 or are they retrievable via API? i don't 
> need data, values, etc.., just the entire set of matching id's.

The matcher doesn't work in the obvious way - considering every document
which matches the boolean query and then looking at its weight would be
slower.

There are a number of optimisations at work, but here's a flavour of how
things work:

Once we have enough documents to fill the mset we've been asked for (so
10 in your case) we track the lowest weight amongst these.  Any document
with a lower weight than this is now irrelevant to us.

For an OR query of two terms, we know the maximum weight that each term
can contribute (and there's also a per document component that we also
have a maximum for).  At some point the threshold weight for entering
the proto-mset will probably have increased to such a level that only
documents containing both of the terms can make it in.  At this point
we switch the OR for an AND and no longer even see document ids which
match only one of the terms.

So if you want the entire set of matching ids, you really do have to ask
for it.  This will defeat most of the optimisations, but there's really
no way around that.  The optimisations work precisely because we can
avoid looking at all of this set in most cases.

You could potentially catch the subset which was considered but didn't
make it into the final mset, but that's generally going to be very
incomplete.

The only overhead you could avoid is that you want all the document ids
but only the first 10 in ranked order.  But I doubt you'd even notice
the saving.

> >>hardware-wise, i'm running a Pentium III (Katmai) @ 551MHz with 512MB 
> >>RAM and 7200 (hdparm reports 39.45 MB/sec).
>
> >How big are the database tables (i.e. what's the output of ls -l on the
> >database directory)?
> 
> 445MB after quartzcompact.

So you've probably got enough RAM to cache pretty much the whole
database.  You could try getting the positionlist table into the OS
cache by using:

cat position_DB > /dev/null

And then see what phrase search speed is like.

Cheers,
    Olly