[Xapian-discuss] Sanity check on database size

Wed Apr 5 16:58:56 BST 2006

On Wed, Apr 05, 2006 at 10:18:52AM +0200, Jean-Francois Dockes wrote:
> The mail store has around 235 MB of data, with 245 000 messages.  It is
> indexed into a quartz database with xapian 0.9.2.

Is that the compacted size?

> When I dump the termlist database, I find around 1.1 million terms (no
> stemming and a lot of garbage indexed), the size of the dump is
> approximately 25 MB. 'delve' says that the average doc size is 540.

Can you post "ls -l" on the database directory?

> The xapian database is around 2.4 GB before compaction, 1.6 after.
> 
> The termlist_DB file is around 480 MB before compaction, 280 after.
> 
> I tried to copy the db to a flint one (with XAPIAN_PREFER_FLINT=yes
> copydatabase ...), this doesn't seem to make a significant difference.

I'd expect the flint database to be appreciably smaller if you're
indexing with positional information, but perhaps the compaction
doesn't shine with many small documents.

> I know that xapian handles vastly bigger document sets, so I'm a bit
> surprised by how the big the database is. In my usual experience on mixed
> sets of documents, I had found the xapian database to be approximately the
> same size as the document set.

Roughly - generally I'd expect the database to be somewhat smaller than
the document set if you're indexing positional information.

The gmane database is currently 59G, which doesn't have positional
information.  Adding positional information usually roughly doubles
the size.  Gmane uses the flint backend.

Incidentally, the flint changes currently in the pipeline should
significantly reduce the size of any database.

> Is the bigger database size due to the small average document size, does
> this look normal, or am I probably doing something weird/wrong ?

You perform stemming at search time, which means more unique terms,
which is likely to expand the size of the database somewhat.  

Also, do you put a limit on term size?  Omega's indexers ignore
probabilistic terms longer than 64 characters, since they're usually
junk like uuencoded or base64 data.

Otherwise I'm not sure what might be going on - "ls -l" will show which
tables are to blame at least.

> Also both my indexing process and the copydatabase one seem to be using
> around 150 MBs of memory (memory usage slowly increases up to this
> value). Is this amount a (possibly tunable) constant or, if it's variable,
> by what is it determined ?

You can currently control the number of documents handled before an
automatic flush (XAPIAN_FLUSH_THRESHOLD environmental variable).
Setting this higher (if you have plenty of memory) leads to faster
indexing, but you need to allow spare memory for the OS to cache disk
blocks to get the best performance.

Given the setting of XAPIAN_FLUSH_THRESHOLD, the memory used depends
mostly on the size of the documents being handled (we buffer the posting
lists as we generate them - essentially we build the inverted file in
XAPIAN_FLUSH_THRESHOLD document chunks).

There's scope for markedly reducing the memory used in the common (and
critical) case where documents are being appended to the end of the 
database (or mostly are).  My plan is to improve this, and to allocate
a block of memory outside the process (using anon mmap on Unix or
GlobalAlloc on Windows) - then we can flush when this gets full, and
we can also release this memory back to the OS once we're done with
it.

Cheers,
    Olly