[Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s

Fri Feb 9 19:24:28 GMT 2007

Olly,

- Yes I did read the fact that XAPIAN_FLUSH_THRESHOLD_LENGTH has no more
effect and was removed, I was just not sure. It was good decision because I
was getting confused how balance between number of records and maximum
memory used. If the bottleneck of surpassing 20 million records is in CPU
cache that is great because we will not be limited in the future with better
servers.

- I am building 2 prototypes to measure performance between Lucene .NET Win
and Xapian  Linux. Therefore for my prototype I am simply using the
scriptindex (/usr/local/bin/scriptindex --stemmer=none /home/kevin/index1
indexscript1 $filename) to index 20 million of records. If Xapian will
perform better then Lucene then I will write new search using C/C++ and will
use WritableDatabase::add_document() ... Thank you for the suggestion.

  Field1 : field
  Field2 : boolean=XU field
  Field3 : indexnopos

I cannot split the 20 million records on subset because it is already subset
of 300 million records that will be concurrently searched on 15 machines.

With small modifications I was able to shorten the time for Xapian to index
20 million records to 52 minutes on (8CPU/32GB) machine. I also measured
search performance of 50,000 unique Boolean searches with randomly generated
terms between 100-1000. This time new searches was not forking new processes
and not opening and closing index database.  The average Xapian/Perl search
improved from  (140ms-280ms) to 55 ms on 20 million index which is 3x times
better then when each search run Xapian Perl script separately.

Now that I have prototypes completed, I will bring Xapian/Perl on Linux and
Lucene/C#.NET on Windows on the two same machines (4CPU/16GB)  using the
same data to measure speed of indexing and speed of searches with the same
Boolean search criteria and see who is faster.

PS: Sometimes to know the truth you need to find it yourself
:-)
-Kevin Duraj

On 2/8/07, Olly Betts <olly at survex.com> wrote:
>
> On Wed, Feb 07, 2007 at 01:21:06PM -0800, Kevin Duraj wrote:
> > Gentoo Linux 2.6
> > 8 AMD Opteron 64-bit Processors
> > 32GB Memory
> >
> --------------------------------------------------------------------------------
> >
> > Environment:
> > ------------------
> > XAPIAN_FLUSH_THRESHOLD=21000000
> > XAPIAN_FLUSH_THRESHOLD_LENGTH=16000000
>
> Setting XAPIAN_FLUSH_THRESHOLD_LENGTH no longer does anything (it was
> removed in September 2004).
>
> > PS: In my scenario after 25 million records the indexing significantly
> > slows down (2x-4x) I do not know why? Could it be because of the
> > B-Tree become very complex?
>
> That seems unlikely, the B-Tree complexity grows logarithmically.
>
> It's probably a cache effect - as the working set of a process grows,
> performance can suddenly get worse when it just fails to fit in the
> available CPU cache.  In your case, I suspect it's some key subset of
> the working set which is the issue.
>
> When indexing, do you only call WritableDatabase::add_document()?  If
> so, we should be able to index significantly faster than this by
> buffering appended changes in a more compact way.
>
> Cheers,
>     Olly
>