Amount of writes during index creation
Jean-Francois Dockes
jf at dockes.org
Mon Feb 4 18:00:10 GMT 2019
Olly Betts writes:
> On Thu, Jan 31, 2019 at 08:44:44PM +0100, Jean-Francois Dockes wrote:
> > I have run a number of tests, with data mostly from a project
> > gutenberg dvd and other books, with relatively modest index sizes,
> > from 1 to 24 GB.
> >
> > Quite curiously, in this zone, with all Xapian versions I tried, the
> > ratio from index size to the amount of writes is roughly proportional
> > to the index size to the power 1.5
> >
> > TotalWrites / (IndexSize**1.5) ~= K
>
> I could perhaps believe it would tend to O(n*log(n)) eventually due to
> the number of levels in the B-tree being log(n) (though the number of
> levels is bounded above by a fairly small constant so one could
> argue that's O(n)).
>
> But probably the merging on commit will actually determine the O()
> behaviour, and that's harder to determine theoretically.
The 1.5 exponent is indeed frankly bizarre, but it holds rather well for
index sizes from 1.5 to 24 GB in this configuration... Just a curiosity.
> > size writes K writes/size
> >
> > 1402524 1597352 0.96 1.14
> > 2223076 3291588 0.99 1.48
> > 2678404 4121024 0.94 1.54
> > 3842372 7219404 0.96 1.88
> > 4964132 10850844 0.98 2.19
> > 6062204 14751196 0.99 2.43
> > 19677680 125418760 1.44 6.37
> > 24349248 166162068 1.38 6.82
> > The amount of writes is estimated with iostat before/after. The disk has
> > nothing else to do.
>
> There's a script in git which allows more precise I/O analysis by
> logging relevant I/O using strace:
>
> xapian-maintainer-tools/profiling/strace-analyse
>
> Using strace means other processes are definitely excluded and you get
> to see which tables (and even which blocks) the I/O is, e.g. a small
> update to a small database gives:
>
> [...]
I tried to use strace -c, but for some reason, the pwrite counts in the
results were erratic (sometimes getting something like 11 writes after
indexing), probably some issue with my script, so I did not use them.
The output was to a backup disk, with no other activity during the tests.
> If you're going to the trouble of profiling, probably best to use the
> latest release (1.4.5 was released in 2017).
I was trying an older release to see if something had changed for the worse
recently.
> > xapian git master latest idxflushmb 200
> > xapian git master before patch idxflushmb 200
> There are other changes between RELEASE/1.4 and master which will
> likely affect improve indexing speed and memory use, but I'm not sure
> there's anything which would affect disk writes (unless we end up
> swapping to disk with 1.4 but master avoids doing so due to lower memory
> usage).
Oops, sorry, the lines above should have read RELEASE/1.4, not master. Only
the later test with a small flush interval was done with master (by mistake).
Definitely no swapping to this disk.
> > The improvement brought by the patch is nice. It remains that for
> > people using big indexes on SSD, the amount of writes is still
> > something to consider, and splitting the index probably makes sense ?
> > What do you think ?
>
> If you want to build a very large DB it's almost certain to be faster to
> build it as a series of smaller DBs and merge them.
Thanks for the confirmation, this is what the reporting user has concluded,
I'll confirm to them that it is the right approach.
> At least with the current backends (glass and older) - the plan for the
> next backend (honey) is that it'll actually behave like that behind the
> scenes, but that part isn't fully written yet.
I am sure that people with big indexes will appreciate !
Cheers,
jf
More information about the Xapian-discuss
mailing list