[Xapian-tickets] [Xapian] #207: Add ability to accelerate wildcard queries for short terms

Thu Sep 30 11:19:17 BST 2010

#207: Add ability to accelerate wildcard queries for short terms
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  richard  
     Type:  defect       |       Status:  new      
 Priority:  normal       |    Milestone:           
Component:  QueryParser  |      Version:  SVN trunk
 Severity:  normal       |   Resolution:           
 Keywords:               |    Blockedby:           
 Platform:  All          |     Blocking:           
-------------------------+--------------------------------------------------
Description changed by olly:

Old description:

> When doing a wildcard query (or a partial term query), it may be
> desirable to
> precompute the lists of documents for short query terms to avoid very
> slow
> searches.  One strategy I've experimented with is indexing the first 1,
> 2, and 3
> characters of each term, marked by an I prefix, to so that 1, 2 or 3
> letter
> searches only need to access a single term.
>
> For example, "words" would be indexed as "Iw", "Iwo", "Iwor" and "words".
>
> The expansion would be done on unstemmed terms - if you try and apply it
> to
> stemmed words, all sorts of confusion occurs if the stem has a different
> first 3
> characters than the unstemmed form.  Wildcards are currently handled by
> looking
> for unstemmed forms anyway, so I don't think this is a problem.
>
> Obviously, it might be sensible to use a different maximum prefix length
> than 3.
>  Also, it may not be desirable to store all the prefixes: for example, if
> only
> the 3 letter prefixes were stored (rather than the 2 and 1 letter
> prefixes being
> stored as well) a search for "w*" could still be implemented more
> efficiently
> than before using the conjunction of all the 3 letter prefixes terms
> which begin
> with "Iw".  However, there could still be a large number of these.
>
> To implement this, support needs to be added to the
> Term::as_partial_query and
> Term::as_wildcard_query methods in queryparser/queryparser.lemony.  This
> doesn't
> necessarily need a query parser flag, since if the terms aren't present,
> the old
> behaviour can be used.  However, it might be desirable to have a flag to
> turn
> the behaviour on to avoid imposing an overhead on wildcard searches in
> databases
> without the acceleration terms.  Also, support for generating the terms
> needs to
> be added to the TermGenerator - this should be very easy, but will
> require a new
> configuration option.

New description:

 When doing a wildcard query (or a partial term query), it may be desirable
 to
 precompute the lists of documents for short query terms to avoid very slow
 searches.  One strategy I've experimented with is indexing the first 1, 2,
 and 3
 characters of each term, marked by an I prefix, to so that 1, 2 or 3
 letter
 searches only need to access a single term.

 For example, "words" would be indexed as "Iw", "Iwo", "Iwor" and "words".

 The expansion would be done on unstemmed terms - if you try and apply it
 to
 stemmed words, all sorts of confusion occurs if the stem has a different
 first 3
 characters than the unstemmed form.  Wildcards are currently handled by
 looking
 for unstemmed forms anyway, so I don't think this is a problem.

 Obviously, it might be sensible to use a different maximum prefix length
 than 3.
 Also, it may not be desirable to store all the prefixes: for example, if
 only
 the 3 letter prefixes were stored (rather than the 2 and 1 letter prefixes
 being
 stored as well) a search for "w*" could still be implemented more
 efficiently
 than before using the conjunction of all the 3 letter prefixes terms which
 begin
 with "Iw".  However, there could still be a large number of these.

 To implement this, support needs to be added to the Term::as_partial_query
 and
 Term::as_wildcard_query methods in queryparser/queryparser.lemony.  This
 doesn't
 necessarily need a query parser flag, since if the terms aren't present,
 the old
 behaviour can be used.  However, it might be desirable to have a flag to
 turn
 the behaviour on to avoid imposing an overhead on wildcard searches in
 databases
 without the acceleration terms.  Also, support for generating the terms
 needs to
 be added to the !TermGenerator - this should be very easy, but will
 require a new
 configuration option.

--

-- 
Ticket URL: <http://trac.xapian.org/ticket/207#comment:4>
Xapian <http://xapian.org/>
Xapian