[Xapian-discuss] PHP Fatal error while indexing Wikipedia

Olly Betts olly at survex.com
Thu Jan 3 01:51:18 GMT 2008


On Thu, Jan 03, 2008 at 12:24:21AM +0000, James Aylett wrote:
> On Wed, Jan 02, 2008 at 11:53:58PM +0000, Robert Young wrote:
> 
> > Yes, this may be an issue, I'm getting a couple of strange things happen;
> > - It doesn't look like the stemmer is doing anything, just as one
> > example of many, surely woman and women should have the same stem?
> 
> Ideally, but actually the English stemmer doesn't do this at the
> moment

Keep in mind that the purpose of using a stemmer in IR is to improve
retrieval results rather than to anally conflate any set of words with
a common stem.  In particular, in some cases irregular forms aren't
handled simply because there isn't much benefit in doing so.

As James points out, simply changing "*men" -> "*man" is problematic.
You can avoid "amen" and "omen" by looking at the constant/vowel
distances which Snowball calculates, but words like "stamen" and "semen"
are indistinguishable from "women" by those measures.

Anyway, the Snowball project is really the best place to go for stemming
issues - we just import the snowball stemmer code from there:

http://snowball.tartarus.org/

> > - There are lots of things getting indexed which I would not have
> > expeted to be indexed such as numbers and number string combinations
> 
> The default term generator indexes a lot of things which have proven
> useful in the past. We'd like to make it more flexible in the future,
> so if you have a particular way you'd like it to work, let us know.

The issue here is that if you don't create a term for something, it
can't be searched for by a user.  The default TermGenerator strategy
leans towards creating terms if we think someone might find them useful.
Terms containing numbers can be dates, product codes, telephone numbers
(if I get a phone number I don't recognise on caller ID, I often Google
it!)

You can certainly implement a usable search for a particular application
while indexing many fewer terms though.  You just need to think about
what's important for your users.  For example, if you're implementing a
search of the Perl documentation, your users will want to search for
"$@" and "@_".

As James says, TermGenerator should be more flexible (it was a new class
in 1.0.0, so it's still quite young).

Cheers,
    Olly



More information about the Xapian-discuss mailing list