[Xapian-discuss] b+tree to cdb

Mon Sep 27 09:46:27 BST 2004

>>i've been performing some tests with xapian on a set of docs numbering 
>>86801 with average length of 291.895. i've also removed stop words to 
>>help with speed and size but i'm noticing some serious performance hits 
>>when it comes to phrase searches, sometimes hitting +14s as compared to 
>>~0.01s i'm getting with most boolean searches.
> 
> 
> Is this in real use, or when running test queries after building the
> database?  One effect to be aware of is that queries will be a lot
> slower when nothing is cached.  In real use, pretty much all the
> non-leaf blocks from the B-trees being used for the search will be
> cached pretty quickly, as well as many commonly used leaf blocks.
> 

this was querying just after building a fresh index. yeah, i did notice 
significant speedups the more i used it, though i wasn't sure what was 
going on :) this explains it, thanks.

> Are the phrase queries that are slow just a phrase, or combined with
> other terms?  For queries with a phrase and other terms, we could defer
> the phrase check in some cases to reduce how often we actually need to
> do it (there's a bugzilla entry for this).
> 
> There's also a plan to tweak the indexing to eliminate some common cases
> which are currently phrases but don't really need to be (e.g. "e-mail").
> 

these particular tests were pure phrases i.e. "president johnson".

>>the following quote from the documentation has peaked my interest: 
>>"implementing an ultra-compact read-only backend which would take a 
>>quartz b-tree and convert it to something like a cdb hashed file" ... 
>>"If you need serious performance, implementing such a backend is worth 
>>considering.". firstly, would this help with phrase searches and 
>>secondly, what are the details in implementing such a beast? i don't 
>>need write access once indexed so this may be something i'd like to try.
> 
> 
> It would help everything a bit just by reducing the amount of I/O
> required.  A large search system is going to become I/O bound
> eventually, and at that point you win by reducing the amount of I/O and
> improving the useful information in every block cached.
> 
> But first, have you run the database through quartzcompact?  This
> produces Btrees which are fully compact and at revision 1 - they're
> smaller and revision 1 means they're inherently a bit more efficient to
> access.
> 

yep, i ran quartzcompact which does speed things up a bit but i never 
did any pre-compact to post-compact timing tests.

> Also relevant are some plans I've been working on for revised encoding
> schemes to replace those currently used by quartz.  Some tests I did
> showed the new positionlist encoding to be ~15% more compact than what
> we currently use, and we can save on the other tables too (more on some
> less on others) which will all help.
> 

awesome.

> Anyway, as to the the question of producing a cdb backend - it isn't a
> huge undertaking as a lot of quartz can be reused by using the same
> encodings quartz uses, and iterate through all the key/tag pairs in each
> Btree (the code in quartzcompact.cc shows how to do this) and put these
> into a cdb (or whatever format).
> 
> Then a bit of glue code is needed to reimplement the bits of the Btree
> external API which are used to retrieval.
> 
> Looking at the cdb API, it's possible cdb isn't actually suitable as we
> need to be able to support a cursor which can move to the next and
> previous keys.  The alternative would be to not chunk the posting lists,
> which is probably going to hurt performance.  Or store all chunks for a
> term with the same key in the cdb (they allow multiple entries with the
> same key), but that may make skipping over chunks expensive as we'd need
> to read each chunk to check if it was the one we wanted...
> 
> Hmm, I also suspect the licence (which appears to be "You may distribute
> unmodified copies of the cdb package.") isn't GPL compatible.  So a cdb
> backend would be fine for in-house use, but we couldn't distribute the
> cdb library itself with Xapian, and nobody could distribute cdb-enabled
> Xapian packages.
> 
> Also cdb is limited to 4GB per table, which is going to be too
> restrictive for high end users.
> 
> So "something like cdb" might need identifying first!
> 
> I'm not sure quite what the saving would be.  You could calculate the
> exact size of the cdb correspoding to a particular database - just
> iterate over the key/tag pairs, and total up the sizes of each, plus 24
> bytes per pair, plus 2048 bytes per table.  But I've not tried this.
> 
> I suspect quartzcompact plus the new encodings will make more difference
> than converting to cdb.  But you could use the new encoding in a cdb
> too, and get the gains from both!
> 

yeah, this sounds like a job for down the road :) i'm already very 
impressed with xapian using my little dataset but was wondering what 
kind of possiblities there might be with 10+ million docs when something 
that big comes my way. certainly splitting the indexes up will help as 
mentioned in the help docs.

> 
>>one other question if i may. how would one get just the record numbers 
>>for the entire estimated MSet? my html layout at the moment is two 
>>frames, a table of contents on the left and search/document on the 
>>right. in order to show (in the toc) which folder/subfolders/documents 
>>have hits, i'm having to issue 'enquire.get_mset(0, 1000)' to shelve the 
>>first 1000 record numbers then issue another 'enquire.get_mset(0, 10)' 
>>to display the first ten, twenty, etc... of course, the bottleneck is 
>>the first get_mset so removing that step would help wonderfully.
> 
> 
> But get_mset only actually gets the document ids and weights - document
> data, etc is lazily fetched, so if you don't ask for it, you don't incur
> any extra overhead.
> 
> Getting 10 matches is faster not because we're reading less document
> data but because we're able to deploy our shortcutting optimisations
> more often and more effectively.
> 

after issuing get_mset(0,10), are all the document id's thrown out (or 
unavailable) after the first 10 or are they retrievable via API? i don't 
need data, values, etc.., just the entire set of matching id's.

> 
>>hardware-wise, i'm running a Pentium III (Katmai) @ 551MHz with 512MB 
>>RAM and 7200 (hdparm reports 39.45 MB/sec).
> 
> 
> How big are the database tables (i.e. what's the output of ls -l on the
> database directory)?
> 

445MB after quartzcompact.

> Cheers,
>     Olly

thanks much for all your help,
andrew