[Xapian-discuss] ------ stemming

Olly Betts olly at survex.com
Fri Aug 25 00:32:35 BST 2006


On Mon, Aug 07, 2006 at 05:06:54PM +0200, Reini Urban wrote:
> Inspecting a real-life index gives me a lot of R strings of
> Rbla--------------------------------------------

This is intended to allow indexing of things like Cl- (a chloride ion)
but there's currently no sanity check on the number of minuses.  As
you indicate, there ought to be really.

I'm actually somewhat doubtful that it's all that useful.  People
complain if they can't search for C++ and C#, but keeping any trailing
"-" as part of the term risks gluing hyphens onto words if there's
no space between a word and a following hyphen.

> and a lot of ------------------- terms.

You shouldn't get terms consisting only of punctuation.  If you do,
that's definitely a bug.

> Should not '----" and "====" or "****" better be stemmed to let's say 3 
> chars?
> "---"

"=" and "*" are never included in terms, so they aren't an issue.

Cheers,
    Olly



More information about the Xapian-discuss mailing list