[Xapian-discuss] Flint Backend

Thu Jun 23 13:58:21 BST 2005

On Thu, Jun 23, 2005 at 08:25:31AM +0200, Arjen van der Meijden wrote:
> The quartzcompact doesn't do that much for the position-table it goes 
> from 7.8GB (a db that is in use for quite some time now) to 7.0GB, which 
> is about 11% (actually more than I thought I'd know).
> Of course I can't tell which is overhead generated due to it being in 
> long use and what is actual compaction-gain.

You can use "quartzcompact -n" to compact but not do tag splitting to
fill blocks fuller (and "quartzcompact -F" to generate larger than
normal tag chunks and reduce size further, but the I'd not recommend
using this if you plan to update the compacted database again).

The difference between "quartzcompact -n" and "quartzcompact" (or the
extra gain from running "quartzcompact" after "quartzcompact -n") is
probably what you're thinking of as the "actual compaction-gain".

> >I'm certainly interested to hear results of converting real-world
> >databases to flint (especially on positionlist table size).  You can
> >do this like so (assuming sh, bash, zsh or similar):
> >
> >XAPIAN_PREFER_FLINT=1 XAPIAN_FLUSH_THRESHOLD=1000000 copydatabase <qdir> 
> ><fdir>
> >
> >Where <qdir> is the existing Quartz database and <fdir> is the directory
> >to create the flint database in.
> 
> Will this give useable figures if I'd use the current flint-backend, or 
> are the bugs you found such that especially the size of the index is 
> negatively influenced?

With 0.9.1, you can't open a flint index for reading.  Also the
positionlist packing missed out some information necessary to actually
unpack the list again, so the size will be slightly underestimated if
anything.

If you want to try flint, it's probably best to use a snapshot from SVN.
This also has the new "xapian-compact" which is like quartzcompact but
for flint databases.

Incidentally, the gmane search is now running on flint:

http://rain.gmane.org/

That's 26,019,772 documents.  Takes less than 2 days to index using
a strategy of creating databases with around 1,000,000 documents in
and then merging them using xapian-compact on pairs or triples (which
experimenting shows is faster) until there's one big database (the
indexing time dominates, the merge is just a few hours).

> >Reduce 1000000 if you don't have loads of memory.  If this number is
> >more than the number of documents, you'll get something roughly
> >equivalent to what "flintcompact -n" would give, if flintcompact
> >existed!

I've now written "flintcompact" (but called it "xapian-compact" with
an eye to the future!)

> With 1000000 it'd try and store _all_ data for the 1M documents in 
> memory before actually flushing them to disk?

Well, yes and no, depending what you mean.

When indexing the data stored in all the tables apart from the postlist
table is stored to disk right away - only postlist data is buffered
(because we need to "invert" it - we feed in a list of terms for each
document and want a list of documents for each term out).  So it'll
buffer all the postlist data.  And nowhere near as compactly as it is
stored on disk (because it needs to be efficiently modifiable).

> We have about 1M documents indeed, but that takes up much more than the 
> 4GB of memory the production machine has I guess. You can see above what 
> size our position-table is. Development-machines here 'only' have 1GB.

You probably don't want to use XAPIAN_FLUSH_THRESHOLD=1000000 then,
especially as your documents are large.  Hopefully I can make this
parameter self-tuning (and also greatly reduce the space needed for
buffering).

Cheers,
    Olly