[Xapian-tickets] [Xapian] #446: TermGenerator: Strange handling of '+' within a word
Xapian
nobody at xapian.org
Fri Feb 12 01:38:58 GMT 2010
#446: TermGenerator: Strange handling of '+' within a word
-------------------------+--------------------------------------------------
Reporter: cworth | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.0.18
Component: Library API | Version: 1.1.3
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
-------------------------+--------------------------------------------------
Changes (by olly):
* status: new => assigned
* component: Other => Library API
* milestone: => 1.0.18
Old description:
> I asked the TermGenerator to generate terms for a string containing
> " xapian+kanru ". I was surprised to see the result as the following
> two terms:
>
> xapian+
> kanru
>
> I did note that the documentation[1] of the term-generator says that
> "trailing +" is included on a term. But the handling of the above
> seems inconsistent. It appears that the embedded '+' is first treated
> as a non-word character to split the string into "xapian+" and "kanru"
> and then the '+' is identified as trailing, so is considered a
> word-character to yield "xapian+".
>
> I expected the embedded '+' to be treated consistently as a non-word
> character here, (it's not a trailing +), so the desired result would
> be the two terms "xapian" and "kanru".
>
> As always, thanks for Xapian!
>
> -Carl
>
> [1] http://xapian.org/docs/termgenerator.html
>
> PS. The above documentation has phrases like "a few other characters"
> in some places. I would love to see those replaced with lists of the
> actual characters so that I could predict correct results by reading
> the documentation.
New description:
I asked the !TermGenerator to generate terms for a string containing
" xapian+kanru ". I was surprised to see the result as the following
two terms:
xapian+
kanru
I did note that the documentation[1] of the term-generator says that
"trailing +" is included on a term. But the handling of the above
seems inconsistent. It appears that the embedded '+' is first treated
as a non-word character to split the string into "xapian+" and "kanru"
and then the '+' is identified as trailing, so is considered a
word-character to yield "xapian+".
I expected the embedded '+' to be treated consistently as a non-word
character here, (it's not a trailing +), so the desired result would
be the two terms "xapian" and "kanru".
As always, thanks for Xapian!
-Carl
[1] http://xapian.org/docs/termgenerator.html
PS. The above documentation has phrases like "a few other characters"
in some places. I would love to see those replaced with lists of the
actual characters so that I could predict correct results by reading
the documentation.
--
Comment:
!QueryParser already gets this right.
Fixed in trunk r13988.
For 1.0 just backporting this change arguably introduces an
incompatibility in indexing. Not sure if it matters or not, but perhaps
we should index the first term both with and without the suffix there.
--
Ticket URL: <http://trac.xapian.org/ticket/446#comment:1>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list