[Xapian-discuss] PHP Fatal error while indexing Wikipedia

Thu Jan 3 00:24:21 GMT 2008

On Wed, Jan 02, 2008 at 11:53:58PM +0000, Robert Young wrote:

> Yes, this may be an issue, I'm getting a couple of strange things happen;
> - It doesn't look like the stemmer is doing anything, just as one
> example of many, surely woman and women should have the same stem?

Ideally, but actually the English stemmer doesn't do this at the
moment (in the most convenient online wordlist I have to hand, only
about one in six -men words should stem to the same as the equivalent
word ending -man). Try something like 'happiness', which should stem
to 'happi'.

> - How can I have 's removed from the end of terms?

How are you generating your terms? (Again, you may have mentioned this
already - sorry if so.) From later comments, I assume you're using
TermGenerator, probably directly from scriptindex or omindex. If this
is the case, then at the moment you can't: we're including single
apostrophes in terms because otherwise you end up with a lot of 'junk'
words (eg "didn't" => "didn" and "t", which isn't helpful). It also
enabled a chance in the way searches including an apostrophised word
were managed, which improved the speed of them.

If you just want to kill them always, you probably need a custom
stemmer. It shouldn't be too hard, but you're putting yourself in for
more work doing that.

> - Wikipedia has lots of words in other languages (completely different
> character sets) is there a way of getting the indexer to ignore terms
> with characters outside a given range?

You'll probably need a custom indexer for this. You need to think
carefully about your index plan as well at this point - do you
genuinely want to just drop those words? Are they marked up correctly
in some way to indicate the source language (Wikipedia's output is
HTML, so they really should, but I wouldn't be surprised if they
aren't)?

> - There are lots of things getting indexed which I would not have
> expeted to be indexed such as numbers and number string combinations

The default term generator indexes a lot of things which have proven
useful in the past. We'd like to make it more flexible in the future,
so if you have a particular way you'd like it to work, let us know.

> - All terms which start with a letter seem to be duplicated in
> Z-prefixed terms with the same frequency as the unprefixed term,
> what's this for?

This because of the way we do phrase matching (and some other
things). The Z-prefixed terms should be the stemmed variants. There's
more detail at <http://www.xapian.org/docs/termgenerator.html>.

> I've had a read of the rest of your comments and they are very
> interesting and informative. I'm not, however, going to take another
> look at the other problems and possible solutions until I've managed
> to reduce the number of terms being generated. Does that sound like a
> sensible order?

Yes, that sounds reasonable. It's generally a good idea to get the
results you want before trying to optimise :-)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org