[Xapian-discuss] XAPIAN_FLUSH_THRESHOLD

Thu Jun 14 19:25:12 BST 2007

Olly,

My goal is to index 500 million to 1 billion of documents. I selected
range of 10-30 million documents as smaller subset that can be done
approx 1-2 hours and then merge  approx. 50 indexed together into two
or three large indexes.

I am investigating  Xapian 1.xx behaviour, but I see some unpleasant
performance things happening and I am not sure myself what is going on
with the new version, but it appears that indexing suddenly stop
around 8 million document and then the process stop utilizing CPU and
just sits there. I am not sure if that could be my data related,
however I pragmatically cleaning the data, and almost the same data
was index fine with Xapian 0.9.xxx

Thank you, for your quick answers.
-Kevin

On 6/11/07, Olly Betts <olly at survex.com> wrote:
> On Mon, Jun 11, 2007 at 02:42:33PM -0700, Kevin Duraj wrote:
> > I have been using to index XAPIAN_FLUSH_THRESHOLD for 10 million
> > documents over 6 months and it works fine and fast until the Xapian
> > version 1.0. It used to take 50 minutes to index 10 million documents.
> > By installing Xapian 1.0.0.  ... now 10 million documents takes approx
> > 16 hours to index. I was looking for bugs in my code but saw that very
> > little memory has been used even when threshold was set to 10 million.
>
> This doesn't make sense to me.  Yes, the compression will use more CPU
> time, but it shouldn't make much difference to the process size.
>
> A lot of things changed in Xapian 1.0.0, not just compression.  So it
> would be useful if you could profile so we can actually see where the
> extra time is spent, rather than just guessing.
>
> If you're on Linux, the best tool seems to be oprofile, because it
> samples the whole system (kernel and userland).  I believe a 2.6
> kernel is needed for best results - just run your indexer under oprofile
> with the callgraph enabled (call opcontrol with --callgraph=12 or
> some suitable stack depth, then run opreport with --callgraph).
>
> This should show exactly where the extra time is being sent.
>
> > I have installed Xapian 1.0.1 it seems to be using more memory that is
> > good.
>
> There weren't any relevant changes to flint between 1.0.0 and 1.0.1
> so it seems unlikely you'd see any real difference.
>
> > What might be large for you is small for others. I want to be
> > able to index 1 billion of documents in reasonable time.
>
> Incidentally, you'll probably get there fastest by building a number of
> smaller databases and merging them using xapian-compact.
>
> > Either Xapian 1.0 does not take in account the threshold or the
> > compression that was introduced takes too much time. We need to have
> > option in environmental variable to disable any compression.
>
> Perhaps.  More likely the thresholds at which to use the compression
> just need tuning, as I've said before.
>
> The most obvious value to play with is COMPRESS_MIN in
> backends/flint/flint_table.cc which is currently 4.  I've not tried
> experimenting with different values, so it would be interesting to
> see some real world benchmarks.
>
> > - I do not care how large the index is, and that compression reduce the
> > size.
> > - I care how much time it takes to index 10-100 million of documents
> > per one index.
>
> There is a connection between the two though.  Up to a point,
> compression will increase indexing speed, because disks are slow
> compared to CPUs and RAM.
>
> Cheers,
>    Olly
>

-- 
Cheers,
   Kevin