[Xapian-tickets] [Xapian] #446: TermGenerator: Strange handling of '+' within a word

Xapian nobody at xapian.org
Fri Feb 12 01:38:58 GMT 2010


#446: TermGenerator: Strange handling of '+' within a word
-------------------------+--------------------------------------------------
 Reporter:  cworth       |       Owner:  olly    
     Type:  defect       |      Status:  assigned
 Priority:  normal       |   Milestone:  1.0.18  
Component:  Library API  |     Version:  1.1.3   
 Severity:  normal       |    Keywords:          
Blockedby:               |    Platform:  All     
 Blocking:               |  
-------------------------+--------------------------------------------------
Changes (by olly):

  * status:  new => assigned
  * component:  Other => Library API
  * milestone:  => 1.0.18


Old description:

> I asked the TermGenerator to generate terms for a string containing
> " xapian+kanru ". I was surprised to see the result as the following
> two terms:
>
>         xapian+
>         kanru
>
> I did note that the documentation[1] of the term-generator says that
> "trailing +" is included on a term. But the handling of the above
> seems inconsistent. It appears that the embedded '+' is first treated
> as a non-word character to split the string into "xapian+" and "kanru"
> and then the '+' is identified as trailing, so is considered a
> word-character to yield "xapian+".
>
> I expected the embedded '+' to be treated consistently as a non-word
> character here, (it's not a trailing +), so the desired result would
> be the two terms "xapian" and "kanru".
>
> As always, thanks for Xapian!
>
> -Carl
>
> [1] http://xapian.org/docs/termgenerator.html
>
> PS. The above documentation has phrases like "a few other characters"
> in some places. I would love to see those replaced with lists of the
> actual characters so that I could predict correct results by reading
> the documentation.

New description:

 I asked the !TermGenerator to generate terms for a string containing
 " xapian+kanru ". I was surprised to see the result as the following
 two terms:

         xapian+
         kanru

 I did note that the documentation[1] of the term-generator says that
 "trailing +" is included on a term. But the handling of the above
 seems inconsistent. It appears that the embedded '+' is first treated
 as a non-word character to split the string into "xapian+" and "kanru"
 and then the '+' is identified as trailing, so is considered a
 word-character to yield "xapian+".

 I expected the embedded '+' to be treated consistently as a non-word
 character here, (it's not a trailing +), so the desired result would
 be the two terms "xapian" and "kanru".

 As always, thanks for Xapian!

 -Carl

 [1] http://xapian.org/docs/termgenerator.html

 PS. The above documentation has phrases like "a few other characters"
 in some places. I would love to see those replaced with lists of the
 actual characters so that I could predict correct results by reading
 the documentation.

--

Comment:

 !QueryParser already gets this right.

 Fixed in trunk r13988.

 For 1.0 just backporting this change arguably introduces an
 incompatibility in indexing.  Not sure if it matters or not, but perhaps
 we should index the first term both with and without the suffix there.

-- 
Ticket URL: <http://trac.xapian.org/ticket/446#comment:1>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list