[Xapian-tickets] [Xapian] #22: Eliminate common cases which cause a slow phrase search

Tue Aug 5 08:11:41 BST 2008

#22: Eliminate common cases which cause a slow phrase search
-------------------------+--------------------------------------------------
 Reporter:  olly         |        Owner:  olly    
     Type:  enhancement  |       Status:  assigned
 Priority:  normal       |    Milestone:  1.1.0   
Component:  QueryParser  |      Version:  SVN HEAD
 Severity:  minor        |   Resolution:          
 Keywords:               |    Blockedby:          
 Platform:  All          |     Blocking:          
-------------------------+--------------------------------------------------

Old description:

> Some common punctuation is treated as a word break when indexing, and as
> a phrase generator when searching.  The problem is that many common cases
> end up creating phrase searches with single or dual character terms which
> are very common, and these search are slow with a big database.
>
> Examples include e-mail, cd-r, olly's.
>
> This could perhaps be addressed by a smarter word identifying algorithm.
> When indexing and searching, we could decide never to generate a single
> character term in certain circumstances (maybe also apply the same rules
> for 2 character terms).
>
> So "e-mail" would be indexed as "email" not "e" and "mail".  And
> similarly for searching.  In general the extra conflation this gives
> seems useful (although email is apparently dutch for enamel...)
>
> The query parser probably wouldn't apply this rule to quoted phrase
> searches - otherwise searching for "o freddled gruntbuggly" would
> search for "ofreddled gruntbuggly" and tragically not find any matches
> (I'm sure there are less esoteric examples - a search for "i robot"
> say...)
>
> I'm not quite sure what to do about "1.0".  Perhaps numbers should
> be indexed specially as is?  So generate a term "1.0".

New description:

 Some common punctuation (notably {{{-}}}) is treated as a word break when
 indexing, and as
 a phrase generator when searching.  The problem is that many common cases
 end up creating phrase searches with one or two character terms which
 are very common, and these search are slow with a big database.

 Examples include: {{{{e-mail}}} {{{cd-r}}} {{{d-i-y}}}

 This could perhaps be addressed by a smarter word identifying algorithm.
 When indexing and searching, we could decide never to generate a single
 character term in certain circumstances (maybe also apply the same rules
 for two character terms).

 So "e-mail" would be indexed as "email" not "e" and "mail".  And
 similarly for searching.  In general the extra conflation this gives
 seems useful (although email is apparently dutch for enamel...)

 The query parser probably wouldn't apply this rule to quoted phrase
 searches - otherwise searching for "o freddled gruntbuggly" would
 search for "ofreddled gruntbuggly" and tragically not find any matches
 (I'm sure there are less esoteric examples - a search for "i robot"
 say...)

--

Comment(by olly):

 Update description.

-- 
Ticket URL: <http://trac.xapian.org/ticket/22#comment:20>
Xapian <http://xapian.org/>
Xapian