[Xapian-discuss] PHP Fatal error while indexing Wikipedia

Robert Young bubblenut at gmail.com
Thu Jan 3 01:32:16 GMT 2008


On Jan 3, 2008 1:08 AM, Olly Betts <olly at survex.com> wrote:
> The "delve" example program in xapian-core is a good way to look at
> what's actually been indexed.
Thanks, I'll take a look at it.

> The best optimised (and probably inherently most optimisable) indexing
> is when you're appending large batches of documents to a database using
> add_document().
>
> If you're indexing from scratch and don't have duplicate UID terms in
> the data being indexed (which I assume is true for wikipedia dumps),
> then your replace_document() calls are equivalent to just appending
> with add_document() except that you keep looking up UID terms, which
> means a std::map look-up and then a B-tree lookup.  I don't know the
> overhead of this, but it could be fairly hefty even if the B-tree
> blocks required are all cached.  You could try having a "rebuild" mode
> where add_document() is called.  I'd be interested to hear how much of
> a difference this makes.
Well, it certainly makes a pretty big difference. It's pushed docs /
sec up to just under 30 (about 28 - 30) from fluctuating between 15
and 21. That puts it just a hair's breadth ahead of the same run with
Solr (running at about 28-29/sec). If you're interested this is all
working towards developing a search abstraction layer for PHP. I'm not
quite sure how best to expose that in the interface but it definately
seems worth it, thanks.

Cheers
Rob



More information about the Xapian-discuss mailing list