Amount of writes during index creation

Olly Betts olly at survex.com
Sun Feb 3 22:32:27 GMT 2019


On Thu, Jan 31, 2019 at 08:44:44PM +0100, Jean-Francois Dockes wrote:
> I have run a number of tests, with data mostly from a project gutenberg dvd
> and other books, with relatively modest index sizes, from 1 to 24 GB.
> 
> Quite curiously, in this zone, with all Xapian versions I tried, the ratio
> from index size to the amount of writes is roughly proportional to the index
> size to the power 1.5
> 
> TotalWrites / (IndexSize**1.5) ~= K

I could perhaps believe it would tend to O(n*log(n)) eventually due to
the number of levels in the B-tree being log(n) (though the number of
levels is bounded above by a fairly small constant so one could
argue that's O(n)).

But probably the merging on commit will actually determine the O()
behaviour, and that's harder to determine theoretically.

> The amount of writes is estimated with iostat before/after. The disk has
> nothing else to do.

There's a script in git which allows more precise I/O analysis by
logging relevant I/O using strace:

xapian-maintainer-tools/profiling/strace-analyse

Using strace means other processes are definitely excluded and you get
to see which tables (and even which blocks) the I/O is, e.g. a small
update to a small database gives:

read 0 from tmp.db/record.DB
read 0 from tmp.db/termlist.DB
read 0 from tmp.db/position.DB
read 0 from tmp.db/postlist.DB
write 1 to tmp.db/postlist.DB
write 1 to tmp.db/position.DB
write 1 to tmp.db/termlist.DB
write 1 to tmp.db/record.DB
sync tmp.db/postlist.tmp
sync tmp.db/postlist.DB
read 1 from tmp.db/postlist.DB
sync tmp.db/position.tmp
sync tmp.db/position.DB
read 1 from tmp.db/position.DB
sync tmp.db/termlist.tmp
sync tmp.db/termlist.DB
read 1 from tmp.db/termlist.DB
sync tmp.db/record.tmp
sync tmp.db/record.DB
read 1 from tmp.db/record.DB

> idxflushmb is the number of megabytes of input text between Xapian commits.
> 
> xapiandb,kb	writes,kb	K*1000	sz/w
> 
> xapian 1.4.5 idxflushmb 200

If you're going to the trouble of profiling, probably best to use the
latest release (1.4.5 was released in 2017).

> 1544724		6941286		3.62	4.49	
> 3080540		16312960	3.02	5.30	
> 4606060		21054756	2.13	4.57	
> 6123140		33914344	2.24	5.54	
> 7631788		50452348	2.39	6.61	
> 				
> xapian git master latest idxflushmb 200				
> 
> 1402524		1597352 	0.96	1.14	
> 2223076		3291588 	0.99	1.48
> 2678404		4121024 	0.94	1.54	
> 3842372		7219404		0.96	1.88	
> 4964132		10850844	0.98	2.19	
> 6062204		14751196	0.99	2.43	
> 19677680	125418760	1.44	6.37
> 				
> xapian git master before patch idxflushmb 200				
> 
> 24707840	750228444	6.11	30.36	

OK, so the patch makes a very significant difference here.

There are other changes between RELEASE/1.4 and master which will
likely affect improve indexing speed and memory use, but I'm not sure
there's anything which would affect disk writes (unless we end up
swapping to disk with 1.4 but master avoids doing so due to lower memory
usage).

> The improvement brought by the patch is nice. It remains that for people
> using big indexes on SSD, the amount of writes is still something to
> consider, and splitting the index probably makes sense ? What do you think ?

If you want to build a very large DB it's almost certain to be faster to
build it as a series of smaller DBs and merge them.

At least with the current backends (glass and older) - the plan for the
next backend (honey) is that it'll actually behave like that behind the
scenes, but that part isn't fully written yet.

Cheers,
    Olly



More information about the Xapian-discuss mailing list