[Xapian-tickets] [Xapian] #22: Eliminate common cases which cause a slow phrase search
Xapian
nobody at xapian.org
Tue Dec 16 14:55:26 GMT 2008
#22: Eliminate common cases which cause a slow phrase search
-------------------------+--------------------------------------------------
Reporter: olly | Owner: olly
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.1.0
Component: QueryParser | Version: SVN trunk
Severity: minor | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Description changed by olly:
Old description:
> Some common punctuation (notably {{{-}}}) is treated as a word break when
> indexing, and as
> a phrase generator when searching. The problem is that many common cases
> end up creating phrase searches with one or two character terms which
> are very common, and these search are slow with a big database.
>
> Examples include: {{{{e-mail}}} {{{cd-r}}} {{{d-i-y}}}
>
> This could perhaps be addressed by a smarter word identifying algorithm.
> When indexing and searching, we could decide never to generate a single
> character term in certain circumstances (maybe also apply the same rules
> for two character terms).
>
> So "e-mail" would be indexed as "email" not "e" and "mail". And
> similarly for searching. In general the extra conflation this gives
> seems useful (although email is apparently dutch for enamel...)
>
> The query parser probably wouldn't apply this rule to quoted phrase
> searches - otherwise searching for "o freddled gruntbuggly" would
> search for "ofreddled gruntbuggly" and tragically not find any matches
> (I'm sure there are less esoteric examples - a search for "i robot"
> say...)
New description:
Some common punctuation (notably {{{-}}}) is treated as a word break when
indexing, and as
a phrase generator when searching. The problem is that many common cases
end up creating phrase searches with one or two character terms which
are very common, and these search are slow with a big database.
Examples include: {{{e-mail}}} {{{cd-r}}} {{{d-i-y}}}
This could perhaps be addressed by a smarter word identifying algorithm.
When indexing and searching, we could decide never to generate a single
character term in certain circumstances (maybe also apply the same rules
for two character terms).
So "e-mail" would be indexed as "email" not "e" and "mail". And
similarly for searching. In general the extra conflation this gives
seems useful (although email is apparently dutch for enamel...)
The query parser probably wouldn't apply this rule to quoted phrase
searches - otherwise searching for "o freddled gruntbuggly" would
search for "ofreddled gruntbuggly" and tragically not find any matches
(I'm sure there are less esoteric examples - a search for "i robot"
say...)
--
--
Ticket URL: <http://trac.xapian.org/ticket/22#comment:22>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list