[Xapian-discuss] Flint Backend
olly at survex.com
Sun Jun 26 23:45:39 BST 2005
On Sun, Jun 26, 2005 at 10:43:32AM +0200, Arjen van der Meijden wrote:
> I copydatabase'd it to a new quartz-database to see how large that'd be
> when a non-compact database was regenerated from scratch. I ran
> quartzcompact without zlib-compressing the tables and I ran
> quartzcompact using -n -F parameters to see any gains from that.
It doesn't really make sense to use "-n" and "-F" together. Although
the things they do don't conflict, "-n" is saying that you care more
about being able to update than squeeze every last byte out, whereas
"-F" says quite the opposite. Now quartzcompact will happily turn on
both if told to, while xapian-compact will use whichever is specified
last and ignore the other. That's why "-n -F" and "-F" have exactly
the same result!
> I also copydatabase'd a flint version from it and ran xapian-compact on
> that database, and xapian-compact -n -F and -F (they had exactly the
> same result). The xapian-version I used was thunderday's 0.9.1_svn6307.
> Here are the table-sizes, the original working database on our
> production machine, the quartz copy I made from the compacted version of
> that db and the flint-copy:
> Qz 0.8.4 Qz ? copy Flint
> Position 8341782528 7785979904 7456931840
> Postlist 4038926336 3726647296 3726647296
> Record 367075328 407076864 258154496
> Termlist 3506757632 3455180800 1868873728
> Value 92176384 94699520 124583936
Flint currently uses different keys to quartz for everything except the
postlist table, which should result in a smaller size when rebuilding
like this (the keys sort in the same order as the document ids, so we're
always appending in this situation which should mean the Btree stays in
"sequential addition" mode and packs blocks pretty tightly.
The keys themselves are sometimes one byte longer, but overall this
seems to be a win. I'm unsure why the value table does so much worse
with flint though.
> Here are some results for quartzcompact, the no-options + no-zlib,
> the original compacted database with zlib and the compacted -n -F +
> zlib. Please do not that it is actually larger than the original and
> that the position table is not zlib-compressed:
> Qz Qz 084 gz Qz -nF gz
> Position 7424589824 7424589824 7432200192
> Postlist 1708957696 1428889600 1535426560
> Record 254222336 178831360 179888128
> Termlist 1770250240 1249050624 1395597312
> Value 61317120 53313536 53313536
I think "-nF" is probably larger because of the "-n". Can you try with
> Here the xapian-compact results of the flint database. Here -n -F and -F
> produced exactly the same table sizes but they were smaller than the
> original compaction-try. Please do note the position-table is larger
> than in the quartz compacted-cases.
> Flint Flint -nF/-F
> Position 7452794880 7451574272
> Postlist 1644240896 1634279424
> Record 255377408 254418944
> Termlist 1772339200 1764106240
> Value 62177280 62177280
OK, so comparing against the non-zlib, we're a bit better for postlist,
and a bit worse for record/termlist/value. I suspect that's mostly
down to the longer keys, which will be resolved when I replace the Btree
manager (I'm going to make the key compare a virtual function which can
be different for each table, rather than having to encode the keys in
such a way that the byte contents compare in the desired order).
It's a shame that the new position table encoding isn't smaller for you.
I think I might need to look at your data at some point, but I'll try
some more examples locally first in case it's the one I've been using
which is atypical.
> Did that much change in the way quartzcompaction is done from 0.8.4 to
A few things, but mostly the addition of the ability to merge databases.
Inside the Btree manager itself, the "should I split this item" test
is now more sophisticated. It used to always split if there was space,
but that can actually make the database bigger (because it means that
the dividing keys are likely to be longer and that can easily overwhelm
the few bytes saved by cramming a block completely full).
> Is reading from the working, instead of the compacted database a
Almost certainly - there's probably less to read (though bear in mind
that the working database will have a number of blocks which aren't
in use in the current version and these don't need to be read to copy
it), but more to the point a database which is compact with revision
1 (like that quartzcompact and xapian-compact produce) is more efficient
to read and iterate over.
More information about the Xapian-discuss