[Xapian-discuss] merge speed and multiple local DBs design

Ron Kass ron at pidgintech.com
Tue Oct 16 11:17:29 BST 2007


I have just finished a test of merging 3 files and measuring the speed 
and efficiency of the operation, and here are the results:


/------/

/time xapian-compact /fts/FTS_1_part1 /fts/FTS_1_part2 /fts/FTS_1_part3 
/fts/FTS_1_FINAL/

postlist ...postlist: Reduced by 49.6601% 18151496K (36551448K -> 18399952K)
record: INCREASED by 2.22356% 15392K (692224K -> 707616K)
termlist: Reduced by 3.10786% 483896K (15570064K -> 15086168K)
position ...position: INCREASED by 2.56846% 880176K (34268672K -> 35148848K)
value: Reduced by 49.828% 2684256K (5387040K -> 2702784K)
spelling: Size unchanged (0K)
synonym: Size unchanged (0K)

real    312m13.114s
user    88m11.783s
sys     7m16.003s


-----

40G     FTS_1_part1

26G     FTS_1_part2
23G     FTS_1_part3
69G     FTS_1_FINAL


-----

Test machine specs:

CPU: quad-core, intel Xeon

HDD: 500GB SATA2 WD

Mem: 16GB 677MHz


-----

overall documents in merged test databased: 32 million



-----

Size of the databases (merged) reduced overall from 89G to 69G (22.5%)

Speed though... the process took a bit more than 5 hours and took some 
CPU with it. Not too much but some. But more importantly, it did take 
some I/O.


In this case, DB size shrinking is not a goal. If anything, the noted 
fact that the speed of changes to a compressed database will be slower 
than on original uncompressed one (due to decreased reserved space for 
updates) is considered a drawback. But I assume for now that this speed 
difference is not major (certainly not in the longer run?).


However, taking into consideration that load and resources used for the 
merge, and the length of time it took, it appears to be an impractical 
measure if we wanted to do it daily, certainly when we grow the size of 
the database more.

The idea here was to use a daily DB for faster changes/indexing and then 
to merge it every night into a bigger one that contains all the data.

I think it is safe to assume that the time it takes to merge the 
databases is linearly related to the size of the databases. In this case 
90GB for 30M docs.

If that took 5 hours, merging 100M docs (about 300GB) will probably take 
about 17 hours.

Which basically rules out daily merges.

Also, keep in mind that we allocated all the CPU and I/O to the merge. 
If we wanted to run regular indexing at the same time, plus searching 
heavily on that node, plus the fact that we want to allocate 
considerably less resources for 100M docs, all of this might suggest we 
are not going to be able to support the daily merge model.


Any thought/suggestions about the above?

Is our test indicative of the merge operation? Are we doing something 
wrong? Anything incorrect with my assumptions?


Best regards,

Ron



More information about the Xapian-discuss mailing list