[Xapian-discuss] Stopword addition and stemming

Olly Betts olly at survex.com
Mon Nov 15 12:50:16 GMT 2010


On Mon, Nov 15, 2010 at 01:56:32PM +0200, goran kent wrote:
> On 11/15/10, Olly Betts <olly at survex.com> wrote:
> > It often does make sense to choose stopwords based on the vocabulary of
> > the text collection you are working with.  And "us" would probably be a
> > stopword in English anyway.
> >
> > But here bob.co.us is interpreted as a phrase, and stopwords are included
> > in phrases by the QueryParser.
> 
> Just to clarify, the search string was [bob_co.us], where "_" is a
> space., but I gather from what you're saying that this would be a term
> (bob) and a phrase ("co.us") type search anyway, correct?

Ah, sorry, I misread.

Yes, that will be a term and a phrase.

> > In this case, I'm not sure you would want to ignore the ".co.us" part
> > anyway - "bob.co.us" probably has a meaning sufficiently distinct from
> > that of "bob" that you wouldn't want to conflate them.
> 
> See above.

Yes, "co.us" isn't very useful there.

But Xapian doesn't try to look for the phrase "co.us" as a stopword (possibly
it should, though I'm not sure exactly how it ought to work if it did).

Perhaps all stopword phrases should be treated specially, though such cases
aren't always useless - for example, "to be or not to be" is a quote from
Hamlet, and "the the" are a band.

> > Another option would be to treat '.' as a word character when between
> > two letters, and so tokenise bob.co.us as a single term, but that's not
> > supported by TermGenerator and QueryParser currently, so you'd have to
> > patch Xapian or tokenise documents and queries yourself.
> 
> ug, beyond me, I'm afraid.

Actually it's very simple to do - you just need to tweak check_infix() in
queryparser/queryparser.lemony and queryparser/termgenerator_internal.cc
by adding '.' to the first test.

On Mon, Nov 15, 2010 at 02:18:48PM +0200, goran kent wrote:
> Also meant to ask:  can I apply that patch to search-code only, or
> must it also go into the indexing code?

It's only active when searching.

Cheers,
    Olly



More information about the Xapian-discuss mailing list