Amount of writes during index creation

Jean-Francois Dockes jf at dockes.org
Mon Feb 4 18:00:10 GMT 2019


Olly Betts writes:
 > On Thu, Jan 31, 2019 at 08:44:44PM +0100, Jean-Francois Dockes wrote:
 > > I have run a number of tests, with data mostly from a project
 > > gutenberg dvd and other books, with relatively modest index sizes,
 > > from 1 to 24 GB.
 > > 
 > > Quite curiously, in this zone, with all Xapian versions I tried, the
 > > ratio from index size to the amount of writes is roughly proportional
 > > to the index size to the power 1.5
 > > 
 > > TotalWrites / (IndexSize**1.5) ~= K
 > 
 > I could perhaps believe it would tend to O(n*log(n)) eventually due to
 > the number of levels in the B-tree being log(n) (though the number of
 > levels is bounded above by a fairly small constant so one could
 > argue that's O(n)).
 > 
 > But probably the merging on commit will actually determine the O()
 > behaviour, and that's harder to determine theoretically.

The 1.5 exponent is indeed frankly bizarre, but it holds rather well for
index sizes from 1.5 to 24 GB in this configuration... Just a curiosity.

 > > size       	writes  	K	writes/size
 > >
 > > 1402524		1597352 	0.96	1.14	
 > > 2223076		3291588 	0.99	1.48
 > > 2678404		4121024 	0.94	1.54	
 > > 3842372		7219404		0.96	1.88	
 > > 4964132		10850844	0.98	2.19	
 > > 6062204		14751196	0.99	2.43	
 > > 19677680   	125418760	1.44	6.37
 > > 24349248   	166162068	1.38	6.82

 > > The amount of writes is estimated with iostat before/after. The disk has
 > > nothing else to do.
 > 
 > There's a script in git which allows more precise I/O analysis by
 > logging relevant I/O using strace:
 > 
 > xapian-maintainer-tools/profiling/strace-analyse
 > 
 > Using strace means other processes are definitely excluded and you get
 > to see which tables (and even which blocks) the I/O is, e.g. a small
 > update to a small database gives:
 > 
 > [...]


I tried to use strace -c, but for some reason, the pwrite counts in the
results were erratic (sometimes getting something like 11 writes after
indexing), probably some issue with my script, so I did not use them.

The output was to a backup disk, with no other activity during the tests.
 
 > If you're going to the trouble of profiling, probably best to use the
 > latest release (1.4.5 was released in 2017).

I was trying an older release to see if something had changed for the worse
recently.

 > > xapian git master latest idxflushmb 200				
 > > xapian git master before patch idxflushmb 200				

 > There are other changes between RELEASE/1.4 and master which will
 > likely affect improve indexing speed and memory use, but I'm not sure
 > there's anything which would affect disk writes (unless we end up
 > swapping to disk with 1.4 but master avoids doing so due to lower memory
 > usage).

Oops, sorry, the lines above should have read RELEASE/1.4, not master. Only
the later test with a small flush interval was done with master (by mistake).

Definitely no swapping to this disk.

 > > The improvement brought by the patch is nice. It remains that for
 > > people using big indexes on SSD, the amount of writes is still
 > > something to consider, and splitting the index probably makes sense ?
 > > What do you think ?
 > 
 > If you want to build a very large DB it's almost certain to be faster to
 > build it as a series of smaller DBs and merge them.

Thanks for the confirmation, this is what the reporting user has concluded,
I'll confirm to them that it is the right approach.

 > At least with the current backends (glass and older) - the plan for the
 > next backend (honey) is that it'll actually behave like that behind the
 > scenes, but that part isn't fully written yet.

I am sure that people with big indexes will appreciate !

Cheers,

jf



More information about the Xapian-discuss mailing list