[Xapian-discuss] PHP Fatal error while indexing Wikipedia

Thu Jan 3 01:08:39 GMT 2008

On Wed, Jan 02, 2008 at 08:15:40PM +0000, James Aylett wrote:
> If you have a large number of unique terms being generated, you'll get
> a large database. There may be something to do with your term
> generation that's unexpected here - you can dump a list of terms with
> a little PHP script to find out what's going on, perhaps.

The "delve" example program in xapian-core is a good way to look at
what's actually been indexed.

> I don't actually know how replace_document works precisely when given
> a unique identifying term (which is what I assume you mean by UID).

It's pretty much what you'd probably guess - it looks up the posting
list entry for the UID term to get the document id, and if it finds
it calls replace_document() with that document id (and delete_document()
for any other docids indexed by the UID term).  If not found, it call
add_document().

There might be smarter ways to do this - it's not something which
anyone has tried to optimise especially yet.

The best optimised (and probably inherently most optimisable) indexing
is when you're appending large batches of documents to a database using
add_document().

If you're indexing from scratch and don't have duplicate UID terms in
the data being indexed (which I assume is true for wikipedia dumps),
then your replace_document() calls are equivalent to just appending
with add_document() except that you keep looking up UID terms, which
means a std::map look-up and then a B-tree lookup.  I don't know the
overhead of this, but it could be fairly hefty even if the B-tree
blocks required are all cached.  You could try having a "rebuild" mode
where add_document() is called.  I'd be interested to hear how much of
a difference this makes.

Cheers,
    Olly