[Xapian-discuss] weak populated b-trees?

Arjen van der Meijden acmmailing at tweakers.net
Sun Sep 7 10:00:45 BST 2008


Hi Markus,

I have no answers for most of your questions, but can confirm your 
savings are quite large. Our largest database contains 1.276.595 
documents and the total database size is about 25GB. The total 
plain-text size of the corpus is about 15-16GB. Our compacted database 
is about 16GB in size.

Our database gets only one update-run per day however, and the amount of 
daily document changes that get added and replaced daily vary between 
150-300 and 300-600 respectively. We have normally no or close to zero 
deletes.

So much less changes, but our index is now 6 months old.

Our strategy is to leave a working copy for 'fast' updates and a 
compacted version for faster retrieval. We haven't actually done any 
recent benchmarks, but its supposed to be slightly faster this way both 
in terms of indexing and retrieval.

Our compaction ratio is much less dramatic than yours, actually only our 
postlist sees large savings:

position.DB	11G	11G	3,92%
postlist.DB	9.7G	4.5G	54,17%
record.DB	229M	220M	3,78%
termlist.DB	3.2G	3.0G	8,32%
value.DB	91M	46M	50,19%

I don't really understand your objection to run xapian-compact. The main 
disadvantage of compaction is that your first few update-runs will have 
to do a relatively large amount of (extra) block-splits. But then again, 
  you may actually gain in indexing-performance in your case for the 
same reason retrieval performance should increase quite a bit.

For retrieval speed, most gains come from reducing the amount of I/O. 
This is done in two ways, being smart which blocks to read and by simply 
having less blocks to read in total.
In your case you can dramatically increase the file system's cache-hit 
ratio if you have less than 16GB of memory in your server. And even if 
you have 16GB or more, there are simply much less blocks to be read for 
every single query, so you should still win.

If you have to offer a continuously updating database to your users, I'd 
only do a xapian-compact after you've done some major change to the 
database. And if you mean by 'all documents got rebuild about 5 times' 
that you could also have started from scratch and just build a new index 
with all the documents and throw away the old one once you're done, I 
would do that too.
If you only update periodically, you could try to keep two copies of 
your database, one 'working version' and one 'retrieval version' which 
is just a compacted version of the working copy. And in your case you 
could also decide to replace the working copy with the retrieval copy 
once in a while.

Best regards,

Arjen

On 5-9-2008 20:14 Markus Wörle wrote:
> Hi,
> 
> I just ran xapian-compact on an index which comsumes about 12 GB of  
> disk space, containing 858.383 documents with an average doclength of  
> 169.018, and got surprised by a huge compactification factor which I  
> haven't expected. After compactification, the index needed only 3.8 GB  
> on disk anymore.
> 
> My expection was that it would only shrink about 25% or so, because of  
> the average allocation of b-tree blocks with I expected to be about 75%.
> 
> This is what xapian-compact said:
> 
> postlist: Reduced by 76.888% 2444640K (3179480K -> 734840K)
> record: Reduced by 65.4923% 1446352K (2208432K -> 762080K)
> termlist: Reduced by 67.2607% 1110312K (1650760K -> 540448K)
> position: Reduced by 56.6145% 2342160K (4137032K -> 1794872K)
> value: Reduced by 81.2667% 397264K (488840K -> 91576K)
> spelling: Size unchanged (0K)
> synonym: Size unchanged (0K)
> 
> My Index' brief history:
> 
> The index was once built from scrach with add_document(), and got  
> updated by a large amount of replace_document_by_term(),  
> add_document(), and delete_document_by_term() over a longer period  
> (about 2 month or so). Some numbers: about 1 million modifications per  
> day, and thereof about 4000 document adds, and 3000 removes.  
> Additionally, in this 2-month-period, all documents got rebuild about  
> 5 times by using replace_document_by_term() on a unique term for each  
> document.
> 
> So my question is: Is this reasonable? Respectively, do you have any  
> idea why my b-trees are such empty? Does Xapian merge weakly populated  
> blocks again?
> 
> I am currently planning to stop indexing once a day to run xapian- 
> compact, but I am uncertain if this whould "denaturate" the system. I  
> have many modifications, and althought "best indexing performance" is  
> not really a point in my use-case, I feel somehow bad about  
> manipulating a natural-balanced b-tree in a non-changing environment.  
> What do you suggest?
> 
> Thanks,
> mrks
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
> 



More information about the Xapian-discuss mailing list