[Xapian-discuss] Sanity check on database size

Fri Apr 7 02:24:18 BST 2006

On Thu, Apr 06, 2006 at 11:32:24AM +0200, Jean-Francois Dockes wrote:
> By the way, in the course of my 'investigation', I looked for a document
> with at least a rough description of the contents and organisation of the
> database tables, and how they are used during a query, but, if it does
> exist, there doesn't seem to be an obvious pointer to it. Such a document
> would be extremely useful to understand what one is doing while using the
> API.

Quartz's structure is described here:

http://www.xapian.org/docs/quartzdesign.html

Flint's is here (though some of this currently describes how it will be
I think):

http://wiki.xapian.org/FlintBackend_2fStructure

How the matcher works is here:

http://www.xapian.org/docs/matcherdesign.html

The first and last are linked from the index page of the documentation
(in the internals section).

> For example I had made an assumption that the size of the file path unique
> terms that I'm using to identify documents did not matter much because prefix
> compression was going to be extremely efficient on them.

We can only easily compress repeated prefixes within a single termlist.
Compressing between termlists is much harder because they can change
independently of one another and we don't want to have to rewrite one
termlist just because we changed another.

> Olly Betts writes:
>  > Roughly - generally I'd expect the database to be somewhat smaller than
>  > the document set if you're indexing positional information.
> 
> Being a personal tool, the assumption for recoll is that space does not
> really matter

It's bad for performance though.  In particular, I suspect many recoll
searches will be isolated events, so they'll be searching with little
or none of the database cached.

What really helps this scenario is minimising the height of the Btrees
since you need to read that many blocks to get to the leaf blocks, which
are where the information is actually stored.  The flint changes I'm
currently working on should markedly reduce the height of Btrees.

>  > Also, do you put a limit on term size?  Omega's indexers ignore
>  > probabilistic terms longer than 64 characters, since they're usually
>  > junk like uuencoded or base64 data.
> 
> Yes, the term size limit is 40 characters. This may probably be a bit low,
> but I just can't imagine a user typing a longer than 40 characters search
> term :)

They might cut-and-paste a query, but 64 is a pretty arbitrary limit and
40 is equally reasonable.

> Actually, from a user point of view, I think that the relevant parameter to
> set is the amount of memory used, not a number of document flush
> threshold. Wouldn't it be possible for xapian to maintain a very rough
> estimate of memory used during indexation, and flush when it exceeds a set
> threshold, independantly of the number of documents indexed ?

It's pretty hard to estimate well - we just shove stuff into a handful
of STL maps.  We could total the lengths of all the strings I suppose
and add stuff for the non-strings and overhead.  But I'm intending to
rewrite it all to store much more compactly in the append case, and
I'm loathe to expend much of my finite time on badly implementing new
features in doomed code.

> corbieres$ XAPIAN_PREFER_FLINT=yes copydatabase xapiandb flint
> corbieres$ ls -s flint/
> total 1967908
>      0 flicklock           12 postlist.baseB       8 termlist.baseB
>      4 iamflint        524332 postlist.DB     319252 termlist.DB
>     16 position.baseA       4 record.baseA         4 value.baseA
>     16 position.baseB       4 record.baseB         4 value.baseB
> 997604 position.DB     126632 record.DB            0 value.DB
>      8 postlist.baseA       8 termlist.baseA

What if you then run xapian-compact on the flint database?

(I'd expect no size change except for postlist.DB which should more
than halve in size...)

Cheers,
    Olly