[Xapian-tickets] [Xapian] #207: Add ability to accelerate wildcard queries for short terms
Xapian
nobody at xapian.org
Thu Sep 30 11:19:17 BST 2010
#207: Add ability to accelerate wildcard queries for short terms
-------------------------+--------------------------------------------------
Reporter: richard | Owner: richard
Type: defect | Status: new
Priority: normal | Milestone:
Component: QueryParser | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Description changed by olly:
Old description:
> When doing a wildcard query (or a partial term query), it may be
> desirable to
> precompute the lists of documents for short query terms to avoid very
> slow
> searches. One strategy I've experimented with is indexing the first 1,
> 2, and 3
> characters of each term, marked by an I prefix, to so that 1, 2 or 3
> letter
> searches only need to access a single term.
>
> For example, "words" would be indexed as "Iw", "Iwo", "Iwor" and "words".
>
> The expansion would be done on unstemmed terms - if you try and apply it
> to
> stemmed words, all sorts of confusion occurs if the stem has a different
> first 3
> characters than the unstemmed form. Wildcards are currently handled by
> looking
> for unstemmed forms anyway, so I don't think this is a problem.
>
> Obviously, it might be sensible to use a different maximum prefix length
> than 3.
> Also, it may not be desirable to store all the prefixes: for example, if
> only
> the 3 letter prefixes were stored (rather than the 2 and 1 letter
> prefixes being
> stored as well) a search for "w*" could still be implemented more
> efficiently
> than before using the conjunction of all the 3 letter prefixes terms
> which begin
> with "Iw". However, there could still be a large number of these.
>
> To implement this, support needs to be added to the
> Term::as_partial_query and
> Term::as_wildcard_query methods in queryparser/queryparser.lemony. This
> doesn't
> necessarily need a query parser flag, since if the terms aren't present,
> the old
> behaviour can be used. However, it might be desirable to have a flag to
> turn
> the behaviour on to avoid imposing an overhead on wildcard searches in
> databases
> without the acceleration terms. Also, support for generating the terms
> needs to
> be added to the TermGenerator - this should be very easy, but will
> require a new
> configuration option.
New description:
When doing a wildcard query (or a partial term query), it may be desirable
to
precompute the lists of documents for short query terms to avoid very slow
searches. One strategy I've experimented with is indexing the first 1, 2,
and 3
characters of each term, marked by an I prefix, to so that 1, 2 or 3
letter
searches only need to access a single term.
For example, "words" would be indexed as "Iw", "Iwo", "Iwor" and "words".
The expansion would be done on unstemmed terms - if you try and apply it
to
stemmed words, all sorts of confusion occurs if the stem has a different
first 3
characters than the unstemmed form. Wildcards are currently handled by
looking
for unstemmed forms anyway, so I don't think this is a problem.
Obviously, it might be sensible to use a different maximum prefix length
than 3.
Also, it may not be desirable to store all the prefixes: for example, if
only
the 3 letter prefixes were stored (rather than the 2 and 1 letter prefixes
being
stored as well) a search for "w*" could still be implemented more
efficiently
than before using the conjunction of all the 3 letter prefixes terms which
begin
with "Iw". However, there could still be a large number of these.
To implement this, support needs to be added to the Term::as_partial_query
and
Term::as_wildcard_query methods in queryparser/queryparser.lemony. This
doesn't
necessarily need a query parser flag, since if the terms aren't present,
the old
behaviour can be used. However, it might be desirable to have a flag to
turn
the behaviour on to avoid imposing an overhead on wildcard searches in
databases
without the acceleration terms. Also, support for generating the terms
needs to
be added to the !TermGenerator - this should be very easy, but will
require a new
configuration option.
--
--
Ticket URL: <http://trac.xapian.org/ticket/207#comment:4>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list