[Xapian-tickets] [Xapian] #22: Eliminate common cases which cause a slow phrase search

Tue Dec 16 14:55:26 GMT 2008

#22: Eliminate common cases which cause a slow phrase search
-------------------------+--------------------------------------------------
 Reporter:  olly         |        Owner:  olly     
     Type:  enhancement  |       Status:  assigned 
 Priority:  normal       |    Milestone:  1.1.0    
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  minor        |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------
Description changed by olly:

Old description:

> Some common punctuation (notably {{{-}}}) is treated as a word break when
> indexing, and as
> a phrase generator when searching.  The problem is that many common cases
> end up creating phrase searches with one or two character terms which
> are very common, and these search are slow with a big database.
>
> Examples include: {{{{e-mail}}} {{{cd-r}}} {{{d-i-y}}}
>
> This could perhaps be addressed by a smarter word identifying algorithm.
> When indexing and searching, we could decide never to generate a single
> character term in certain circumstances (maybe also apply the same rules
> for two character terms).
>
> So "e-mail" would be indexed as "email" not "e" and "mail".  And
> similarly for searching.  In general the extra conflation this gives
> seems useful (although email is apparently dutch for enamel...)
>
> The query parser probably wouldn't apply this rule to quoted phrase
> searches - otherwise searching for "o freddled gruntbuggly" would
> search for "ofreddled gruntbuggly" and tragically not find any matches
> (I'm sure there are less esoteric examples - a search for "i robot"
> say...)

New description:

 Some common punctuation (notably {{{-}}}) is treated as a word break when
 indexing, and as
 a phrase generator when searching.  The problem is that many common cases
 end up creating phrase searches with one or two character terms which
 are very common, and these search are slow with a big database.

 Examples include: {{{e-mail}}} {{{cd-r}}} {{{d-i-y}}}

 This could perhaps be addressed by a smarter word identifying algorithm.
 When indexing and searching, we could decide never to generate a single
 character term in certain circumstances (maybe also apply the same rules
 for two character terms).

 So "e-mail" would be indexed as "email" not "e" and "mail".  And
 similarly for searching.  In general the extra conflation this gives
 seems useful (although email is apparently dutch for enamel...)

 The query parser probably wouldn't apply this rule to quoted phrase
 searches - otherwise searching for "o freddled gruntbuggly" would
 search for "ofreddled gruntbuggly" and tragically not find any matches
 (I'm sure there are less esoteric examples - a search for "i robot"
 say...)

--

-- 
Ticket URL: <http://trac.xapian.org/ticket/22#comment:22>
Xapian <http://xapian.org/>
Xapian