[Xapian-discuss] Re: My new record: Indexing 20 millions docs = 79m9.378s

Olly Betts olly at survex.com
Mon Feb 12 10:19:35 GMT 2007


Kevin Duraj <kevin.softdev at gmail.com> writes:
> - Yes I did read the fact that XAPIAN_FLUSH_THRESHOLD_LENGTH has no more
> effect and was removed, I was just not sure. It was good decision because I
> was getting confused how balance between number of records and maximum
> memory used.

I think that really the threshold should be the amount of memory used to
buffer posting list changes, but that's not easy to really know as things
currently stand.  Also the threshold would ideally automatically tune itself
for good performance by default.  We'll get there eventually...

> - I am building 2 prototypes to measure performance between Lucene .NET Win
> and Xapian  Linux. Therefore for my prototype I am simply using the
> scriptindex (/usr/local/bin/scriptindex --stemmer=none /home/kevin/index1
> indexscript1 $filename) to index 20 million of records. If Xapian will
> perform better then Lucene then I will write new search using C/C++ and will
> use WritableDatabase::add_document() ... Thank you for the suggestion.

Unless you use the "unique" action scriptindex will just call add_document
anyway so that's fine as it is.

It might be interesting to see what speedup this patch gives you:

http://oligarchy.co.uk/xapian/patches/xapian-faster-flint-add-document.patch

It implements more compact storage of pending posting list changes from
add_document with flint.  Currently replace_document and delete_document are
disabled - it's just a quick prototype I'm going to test on a full rebuild of
gmane's index, but the gmane machine is still regenerating the index spool
so I won't be able to test it there for a few more days.

If this looks promising, we can sort out a better version which doesn't disable
the other methods!

Cheers,
    Olly



More information about the Xapian-discuss mailing list