[Xapian-tickets] [Xapian] #22: Eliminate common cases which cause a slow phrase search
Xapian
nobody at xapian.org
Tue Aug 5 08:11:41 BST 2008
#22: Eliminate common cases which cause a slow phrase search
-------------------------+--------------------------------------------------
Reporter: olly | Owner: olly
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.1.0
Component: QueryParser | Version: SVN HEAD
Severity: minor | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Old description:
> Some common punctuation is treated as a word break when indexing, and as
> a phrase generator when searching. The problem is that many common cases
> end up creating phrase searches with single or dual character terms which
> are very common, and these search are slow with a big database.
>
> Examples include e-mail, cd-r, olly's.
>
> This could perhaps be addressed by a smarter word identifying algorithm.
> When indexing and searching, we could decide never to generate a single
> character term in certain circumstances (maybe also apply the same rules
> for 2 character terms).
>
> So "e-mail" would be indexed as "email" not "e" and "mail". And
> similarly for searching. In general the extra conflation this gives
> seems useful (although email is apparently dutch for enamel...)
>
> The query parser probably wouldn't apply this rule to quoted phrase
> searches - otherwise searching for "o freddled gruntbuggly" would
> search for "ofreddled gruntbuggly" and tragically not find any matches
> (I'm sure there are less esoteric examples - a search for "i robot"
> say...)
>
> I'm not quite sure what to do about "1.0". Perhaps numbers should
> be indexed specially as is? So generate a term "1.0".
New description:
Some common punctuation (notably {{{-}}}) is treated as a word break when
indexing, and as
a phrase generator when searching. The problem is that many common cases
end up creating phrase searches with one or two character terms which
are very common, and these search are slow with a big database.
Examples include: {{{{e-mail}}} {{{cd-r}}} {{{d-i-y}}}
This could perhaps be addressed by a smarter word identifying algorithm.
When indexing and searching, we could decide never to generate a single
character term in certain circumstances (maybe also apply the same rules
for two character terms).
So "e-mail" would be indexed as "email" not "e" and "mail". And
similarly for searching. In general the extra conflation this gives
seems useful (although email is apparently dutch for enamel...)
The query parser probably wouldn't apply this rule to quoted phrase
searches - otherwise searching for "o freddled gruntbuggly" would
search for "ofreddled gruntbuggly" and tragically not find any matches
(I'm sure there are less esoteric examples - a search for "i robot"
say...)
--
Comment(by olly):
Update description.
--
Ticket URL: <http://trac.xapian.org/ticket/22#comment:20>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list