[Xapian-discuss] PHP Fatal error while indexing Wikipedia

Robert Young bubblenut at gmail.com
Wed Jan 2 19:13:09 GMT 2008


Hi,

Excellent, thanks Olly, that seems to get rid of the fatal error,
however, things are still not quite right. First, I think a bit of
background may be helpful. As I mentioned before I'm indexing a
Wikipedia dump (approx 13Gb), I'm skipping all redirect entries which
cuts out quite a lot. For the entries which I am indexing (the
articles), I am indexing the page id, the title, and the article text,
then I am setting as data, a serialized object containing just the id
and title. The Wikipedia dump is being processed in 2Gb chunks due to
limitations in PHP.

There are a number of things I'm noticing which I'm not sure are normal;
- The position and postlist files seem to be growing at a tremendous
rate. The indexer hasn't even got past the first 2.0Gb chunk and
already both the position.DB and postlist.DB are each over 1.2Gb. I
have tried to find out exactly what each of the files does but haven't
had much luck. A brief addition to each of the table pages on the wiki
on what the table actually does would be really helpfull and
gratefully recieved.
- As the index gets bigger the disk gets hammered. Now, obviously this
is to be expected to an extend but things are getting really bad,
looking at 90-95% cpu waiting on IO. I'm guessing this is in part due
to the fact that I'm doing this on my laptop with it's crappy laptop
disk and partly due to using replace_document so that it has to do a
query on each update. Is there any way of making queries optimized for
querying uids? Would having an auxhiliary index just for uid to docid
lookups help so that I only need call replace_document on documents I
know are in the index?
- Indexing performance really really drops off as the index grows.
It's not great at any rate as it's running on my laptop but it's been
running for over 12 hours now and it's still not indexed the first 2Gb
chunk. I'm guessing this is related to the second point.

Cheers
Rob



More information about the Xapian-discuss mailing list