[Xapian-discuss] PHP Fatal error while indexing Wikipedia
Olly Betts
olly at survex.com
Thu Jan 3 01:08:39 GMT 2008
On Wed, Jan 02, 2008 at 08:15:40PM +0000, James Aylett wrote:
> If you have a large number of unique terms being generated, you'll get
> a large database. There may be something to do with your term
> generation that's unexpected here - you can dump a list of terms with
> a little PHP script to find out what's going on, perhaps.
The "delve" example program in xapian-core is a good way to look at
what's actually been indexed.
> I don't actually know how replace_document works precisely when given
> a unique identifying term (which is what I assume you mean by UID).
It's pretty much what you'd probably guess - it looks up the posting
list entry for the UID term to get the document id, and if it finds
it calls replace_document() with that document id (and delete_document()
for any other docids indexed by the UID term). If not found, it call
add_document().
There might be smarter ways to do this - it's not something which
anyone has tried to optimise especially yet.
The best optimised (and probably inherently most optimisable) indexing
is when you're appending large batches of documents to a database using
add_document().
If you're indexing from scratch and don't have duplicate UID terms in
the data being indexed (which I assume is true for wikipedia dumps),
then your replace_document() calls are equivalent to just appending
with add_document() except that you keep looking up UID terms, which
means a std::map look-up and then a B-tree lookup. I don't know the
overhead of this, but it could be fairly hefty even if the B-tree
blocks required are all cached. You could try having a "rebuild" mode
where add_document() is called. I'd be interested to hear how much of
a difference this makes.
Cheers,
Olly
More information about the Xapian-discuss
mailing list