Amount of writes during index creation
Olly Betts
olly at survex.com
Sun Feb 3 22:32:27 GMT 2019
On Thu, Jan 31, 2019 at 08:44:44PM +0100, Jean-Francois Dockes wrote:
> I have run a number of tests, with data mostly from a project gutenberg dvd
> and other books, with relatively modest index sizes, from 1 to 24 GB.
>
> Quite curiously, in this zone, with all Xapian versions I tried, the ratio
> from index size to the amount of writes is roughly proportional to the index
> size to the power 1.5
>
> TotalWrites / (IndexSize**1.5) ~= K
I could perhaps believe it would tend to O(n*log(n)) eventually due to
the number of levels in the B-tree being log(n) (though the number of
levels is bounded above by a fairly small constant so one could
argue that's O(n)).
But probably the merging on commit will actually determine the O()
behaviour, and that's harder to determine theoretically.
> The amount of writes is estimated with iostat before/after. The disk has
> nothing else to do.
There's a script in git which allows more precise I/O analysis by
logging relevant I/O using strace:
xapian-maintainer-tools/profiling/strace-analyse
Using strace means other processes are definitely excluded and you get
to see which tables (and even which blocks) the I/O is, e.g. a small
update to a small database gives:
read 0 from tmp.db/record.DB
read 0 from tmp.db/termlist.DB
read 0 from tmp.db/position.DB
read 0 from tmp.db/postlist.DB
write 1 to tmp.db/postlist.DB
write 1 to tmp.db/position.DB
write 1 to tmp.db/termlist.DB
write 1 to tmp.db/record.DB
sync tmp.db/postlist.tmp
sync tmp.db/postlist.DB
read 1 from tmp.db/postlist.DB
sync tmp.db/position.tmp
sync tmp.db/position.DB
read 1 from tmp.db/position.DB
sync tmp.db/termlist.tmp
sync tmp.db/termlist.DB
read 1 from tmp.db/termlist.DB
sync tmp.db/record.tmp
sync tmp.db/record.DB
read 1 from tmp.db/record.DB
> idxflushmb is the number of megabytes of input text between Xapian commits.
>
> xapiandb,kb writes,kb K*1000 sz/w
>
> xapian 1.4.5 idxflushmb 200
If you're going to the trouble of profiling, probably best to use the
latest release (1.4.5 was released in 2017).
> 1544724 6941286 3.62 4.49
> 3080540 16312960 3.02 5.30
> 4606060 21054756 2.13 4.57
> 6123140 33914344 2.24 5.54
> 7631788 50452348 2.39 6.61
>
> xapian git master latest idxflushmb 200
>
> 1402524 1597352 0.96 1.14
> 2223076 3291588 0.99 1.48
> 2678404 4121024 0.94 1.54
> 3842372 7219404 0.96 1.88
> 4964132 10850844 0.98 2.19
> 6062204 14751196 0.99 2.43
> 19677680 125418760 1.44 6.37
>
> xapian git master before patch idxflushmb 200
>
> 24707840 750228444 6.11 30.36
OK, so the patch makes a very significant difference here.
There are other changes between RELEASE/1.4 and master which will
likely affect improve indexing speed and memory use, but I'm not sure
there's anything which would affect disk writes (unless we end up
swapping to disk with 1.4 but master avoids doing so due to lower memory
usage).
> The improvement brought by the patch is nice. It remains that for people
> using big indexes on SSD, the amount of writes is still something to
> consider, and splitting the index probably makes sense ? What do you think ?
If you want to build a very large DB it's almost certain to be faster to
build it as a series of smaller DBs and merge them.
At least with the current backends (glass and older) - the plan for the
next backend (honey) is that it'll actually behave like that behind the
scenes, but that part isn't fully written yet.
Cheers,
Olly
More information about the Xapian-discuss
mailing list