[Xapian-discuss] Stopword addition and stemming

goran kent gorankent at gmail.com
Mon Nov 15 11:56:32 GMT 2010


top-post apology to Kevin Duraj for hijacking his thread.  I have no
idea what the hell I did to cause it.  I want to blame it on overly
permissive stateless web stuff instead of my own stupidity, but...

On 11/15/10, Olly Betts <olly at survex.com> wrote:
> It often does make sense to choose stopwords based on the vocabulary of
> the text collection you are working with.  And "us" would probably be a
> stopword in English anyway.
>
> But here bob.co.us is interpreted as a phrase, and stopwords are included
> in phrases by the QueryParser.

Just to clarify, the search string was [bob_co.us], where "_" is a
space., but I gather from what you're saying that this would be a term
(bob) and a phrase ("co.us") type search anyway, correct?

> In this case, I'm not sure you would want to ignore the ".co.us" part
> anyway - "bob.co.us" probably has a meaning sufficiently distinct from
> that of "bob" that you wouldn't want to conflate them.

See above.

>
> If you aren't already using Xapian 1.2, phrase searching should be faster
> with the new default chert backend.

Using 1.2.3 trunk.

> The patch in this ticket can also make a huge difference to slow phrase
> cases:
>
> http://trac.xapian.org/ticket/394
> It really needs cleaning up and folding into trunk, but I've not had
> time to do so yet.  If you try it, feedback would be much appreciated.

will give it a try and report back.

> Another option would be to treat '.' as a word character when between
> two letters, and so tokenise bob.co.us as a single term, but that's not
> supported by TermGenerator and QueryParser currently, so you'd have to
> patch Xapian or tokenise documents and queries yourself.

ug, beyond me, I'm afraid.

Thanks for the feedback.



More information about the Xapian-discuss mailing list