[Xapian-discuss] Sanity check on database size

Fri Apr 7 10:03:05 BST 2006

Olly Betts writes:
 > Quartz's structure is described here:
 > 
 > http://www.xapian.org/docs/quartzdesign.html
 > 
 > Flint's is here (though some of this currently describes how it will be
 > I think):
 > 
 > http://wiki.xapian.org/FlintBackend_2fStructure
 > 
 > How the matcher works is here:
 > 
 > http://www.xapian.org/docs/matcherdesign.html
 > 
 > The first and last are linked from the index page of the documentation
 > (in the internals section).

The quartz design document is where I got my suspicion that I may want to
be more careful with my usage of path terms...

I have to admit that the matcher document is beyond my intellectual reach :)

What I think is lacking is a 'xapian internals for dummies' document
describing in high level terms what data is stored in each table and how
the different tables' data is accessed and used in the course of processing
a simple query.

Please understand that I am not requesting things to exist, just sharing my
experience trying to understand how my design decisions would affect
database size and performance. I do understand the issues of limited
time and resources, and I'm already very thankful for what does exist ...

 > We can only easily compress repeated prefixes within a single termlist.
 > Compressing between termlists is much harder because they can change
 > independently of one another and we don't want to have to rewrite one
 > termlist just because we changed another.

Yes, this is easy to understand when you're aware of the existence of the
document termlists. This had somehow slipped out of my mind.

 > > Being a personal tool, the assumption for recoll is that space does not
 > > really matter
 > 
 > It's bad for performance though.  In particular, I suspect many recoll
 > searches will be isolated events, so they'll be searching with little
 > or none of the database cached.
 > 
 > What really helps this scenario is minimising the height of the Btrees
 > since you need to read that many blocks to get to the leaf blocks, which
 > are where the information is actually stored.  The flint changes I'm
 > currently working on should markedly reduce the height of Btrees.

Of course you are right, but I think that you may also be biased by working
with huge systems with many users. If you think of the typical desktop
situation, a single-user workstation with a few hundred megabytes of index
data at most, it becomes very hard to get response times over 2-3 seconds
even with an empty cache, and it's more typical to have subsecond
answers. I'm repeatedly told (by my 3 users:)) how fast people think recoll
(meaning xapian) is, even if I really take few precautions, or make
blunders, in (un)optimizing performance.

For reference, timing for a few requests on the message database which
started this thread, and which is made from the apache message archive:

The machine is an Athlon 2400+, with 256 MB of memory, and an average IDE
disk.

Request for "a the an apache". Stem expansion results into: ((a OR the OR
anly OR ans OR an OR apache OR apached OR apacheers OR apacheing OR
apacheness OR apachent OR apacher OR apachers OR apaches OR apaching))
Response time ~ 3-4 S, right after program start after machine
reboot. Second query is instantaneous. The query seems to think there are
234000 candidate answers.

Request for mod_ssl after reboot: < 1S . 9180 answers.

The downside of such a nice database is that it encourages programmer
lazyness...

 > > [JFD]I think that the relevant parameter to
 > > set is the amount of memory used, not a number of document flush
 > > threshold. [...]
 > 
 > It's pretty hard to estimate well - we just shove stuff into a handful
 > of STL maps.  We could total the lengths of all the strings I suppose
 > and add stuff for the non-strings and overhead.  But I'm intending to
 > rewrite it all to store much more compactly in the append case, and
 > I'm loathe to expend much of my finite time on badly implementing new
 > features in doomed code.

I am obviously in agreement and sympathy with not wasting your time. This
was not a request, just thought sharing.

The thing I am suggesting is that, when working with a mixed document set,
even a very rough (say, off by 50%) memory usage estimation might be better
than a document number threshold. 

This is because the local average document size may easily vary by an order
of magnitude depending on what you are currently indexing, the global
average document size means nothing (this is under the assumption that the
average document size does affect memory usage directly, which I think I
read somewhere, but may not hold true for the future flint implementation?).

The indexer might go over the user's mail archive with an average doc size
of a few hundred bytes at most, then into the misc docs area (programming
manuals, mirrored web sites, ebooks, etc.) where the average size will be
more like a few dozen KBs.

I had never been concerned about these issues until a user decided to try
and run a test over a 3GB mail archive just to see, and did see...

 > What if you then run xapian-compact on the flint database?
 > 
 > (I'd expect no size change except for postlist.DB which should more
 > than halve in size...)

I reran the index build direct to flint as I had erased the data. This
gives:

corbieres$ ls -s xapiandb/
total 1967912
     0 flicklock       524324 postlist.DB          8 termlist.baseB
     4 iamflint             4 record.baseA    319252 termlist.DB
    16 position.baseA       4 record.baseB         4 value.baseA
    16 position.baseB  126632 record.DB            4 value.baseB
997604 position.DB          4 stem_english         0 value.DB
    12 postlist.baseA       4 stem_french
    12 postlist.baseB       8 termlist.baseA

corbieres$ XAPIAN_PREFER_FLINT=yes xapian-compact xapiandb compacted
postlist: Reduced by 55.4264% 290328K (523808K -> 233480K)
record: Reduced by 2.73193% 3456K (126504K -> 123048K)
termlist: Reduced by 5.76417% 18384K (318936K -> 300552K)
position: Reduced by 0.0417409% 416K (996624K -> 996208K)
value: Size unchanged (0K)

corbieres$  ls -s compacted/
total 1654988
     4 iamflint        233712 postlist.DB     300852 termlist.DB
     4 position.baseA       4 record.baseA         4 value.baseA
    16 position.baseB       4 record.baseB         4 value.baseB
997188 position.DB     123176 record.DB            0 value.DB
     4 postlist.baseA       4 termlist.baseA
     4 postlist.baseB       8 termlist.baseB

Regards,
Jean-Francois Dockes