Amount of writes during index creation

Jean-Francois Dockes jf at dockes.org
Thu Jan 31 19:44:44 GMT 2019


Olly Betts writes:
 > On Mon, Jan 21, 2019 at 03:25:01PM +0100, Jean-Francois Dockes wrote:
 > > I have had a problem report from a Recoll user about the amount of writes
 > > during index creation.
 > > 
 > > https://opensourceprojects.eu/p/recoll1/tickets/67/
 > > 
 > > The issue is that the index is on SSD and that the amount of writes is
 > > significant compared to the SSD life expectancy (index size > 250 GB).
 > > 
 > > From the numbers he supplied, it seems to me that the total amount of block
 > > writes is roughly quadratic with the index size.
 > > 
 > > First question: is this expected, or is Recoll doing something wrong ?
 > 
 > It isn't expected.
 > 
 > I think this is probably due to a bug which coincidentally was
 > discovered earlier this week by Germán M. Bravo.  I've now fixed it
 > and backported ready for 1.4.10.  If you're able to test to confirm
 > if this solves your problem that would be very useful - see
 > f19bcb96857419469f74f748e7fe8eaccaedc0fd on the RELEASE/1.4 branch:
 > 
 > https://git.xapian.org/?p=xapian;a=commitdiff;h=f19bcb96857419469f74f748e7fe8eaccaedc0fd
 > 
 > Anything which uses a term for a unique document identifier is likely to
 > be affected.
 > 
 > Cheers,
 >     Olly

I have run a number of tests, with data mostly from a project gutenberg dvd
and other books, with relatively modest index sizes, from 1 to 24 GB.

Quite curiously, in this zone, with all Xapian versions I tried, the ratio
from index size to the amount of writes is roughly proportional to the index
size to the power 1.5

TotalWrites / (IndexSize**1.5) ~= K

So, not quadratic, which is good news. For big indexes, 1.5 is not so good
but probably somewhat expected.

The other good news is that the patch above decreases the amount of writing
by a significant factor, around 4.5 for the biggest index I tried.

The amount of writes is estimated with iostat before/after. The disk has
nothing else to do.

idxflushmb is the number of megabytes of input text between Xapian commits.

xapiandb,kb	writes,kb	K*1000	sz/w

xapian 1.4.5 idxflushmb 200

1544724		6941286		3.62	4.49	
3080540		16312960	3.02	5.30	
4606060		21054756	2.13	4.57	
6123140		33914344	2.24	5.54	
7631788		50452348	2.39	6.61	
				
xapian git master latest idxflushmb 200				

1402524		1597352 	0.96	1.14	
2223076		3291588 	0.99	1.48
2678404		4121024 	0.94	1.54	
3842372		7219404		0.96	1.88	
4964132		10850844	0.98	2.19	
6062204		14751196	0.99	2.43	
19677680	125418760	1.44	6.37
				
xapian git master before patch idxflushmb 200				

24707840	750228444	6.11	30.36	

So that was 750 GB of writes for the big index before the patch...

As you can see my beautiful law does not hold so well for the biggest index :)
(K = 1.44)
It's not quite the same data though, so I would need more tests, but I
think I'll stop here...

The improvement brought by the patch is nice. It remains that for people
using big indexes on SSD, the amount of writes is still something to
consider, and splitting the index probably makes sense ? What do you think ?

I'll run another test this night with a smaller flush interval to see if it
changes things.

Cheers,

jf



More information about the Xapian-discuss mailing list