[Xapian-discuss] Stopword addition and stemming
goran kent
gorankent at gmail.com
Mon Nov 15 11:56:32 GMT 2010
top-post apology to Kevin Duraj for hijacking his thread. I have no
idea what the hell I did to cause it. I want to blame it on overly
permissive stateless web stuff instead of my own stupidity, but...
On 11/15/10, Olly Betts <olly at survex.com> wrote:
> It often does make sense to choose stopwords based on the vocabulary of
> the text collection you are working with. And "us" would probably be a
> stopword in English anyway.
>
> But here bob.co.us is interpreted as a phrase, and stopwords are included
> in phrases by the QueryParser.
Just to clarify, the search string was [bob_co.us], where "_" is a
space., but I gather from what you're saying that this would be a term
(bob) and a phrase ("co.us") type search anyway, correct?
> In this case, I'm not sure you would want to ignore the ".co.us" part
> anyway - "bob.co.us" probably has a meaning sufficiently distinct from
> that of "bob" that you wouldn't want to conflate them.
See above.
>
> If you aren't already using Xapian 1.2, phrase searching should be faster
> with the new default chert backend.
Using 1.2.3 trunk.
> The patch in this ticket can also make a huge difference to slow phrase
> cases:
>
> http://trac.xapian.org/ticket/394
> It really needs cleaning up and folding into trunk, but I've not had
> time to do so yet. If you try it, feedback would be much appreciated.
will give it a try and report back.
> Another option would be to treat '.' as a word character when between
> two letters, and so tokenise bob.co.us as a single term, but that's not
> supported by TermGenerator and QueryParser currently, so you'd have to
> patch Xapian or tokenise documents and queries yourself.
ug, beyond me, I'm afraid.
Thanks for the feedback.
More information about the Xapian-discuss
mailing list