[Xapian-tickets] [Xapian] #215: Boolean OR could be optimised further

Xapian nobody at xapian.org
Thu Apr 24 03:13:48 BST 2008


#215: Boolean OR could be optimised further
-------------------------+--------------------------------------------------
 Reporter:  richard      |        Owner:  olly    
     Type:  enhancement  |       Status:  new     
 Priority:  normal       |    Milestone:          
Component:  Matcher      |      Version:  SVN HEAD
 Severity:  minor        |   Resolution:          
 Keywords:               |    Blockedby:          
 Platform:  All          |     Blocking:          
-------------------------+--------------------------------------------------
Changes (by olly):

  * owner:  newbugs => olly
  * status:  assigned => new


Old description:

> For a boolean OR of several terms, we don't care how many of the terms
> match -
> we just care whether any of them do.  Therefore, when we find that one of
> the
> terms matches, we shouldn't waste effort trying to move the other posting
> lists
> to the same position.
>
> Instead, we should hold the posting lists in order of termfrequency,
> highest
> termfrequency first.  To perform a skip_to() operation with ID "minid",
> we'd
> call skip_to(minid) on each of the sub-postlists until one of them moved
> to
> minid.  At that point, there would be no need to call skip_t() on the
> remaining
> sub-postlists.
>
> We'd need to keep track of which sub-postlists have been moved up to the
> current
> position, and which haven't.  When next() is called, we'd call next() on
> any
> sub-postlists which are up-to-date, but we would need to call skip_to()
> on any
> other sub-postlists which are further behind.
>
> We could implement this by introducing special BooleanOrPostingList to
> handle
> this particular case.
>
> I'm not sure that it would lead to much improvement for common cases, but
> it is
> easy to make up cases where it would make a huge improvement (eg, a
> boolean OR
> in which one of the terms is a MatchAll term - none of the other
> postlists would
> need to be moved in this case.)

New description:

 For a boolean OR of several terms, we don't care how many of the terms
 match -
 we just care whether any of them do.  Therefore, when we find that one of
 the
 terms matches, we shouldn't waste effort trying to move the other posting
 lists
 to the same position.

 Instead, we should hold the posting lists in order of termfrequency,
 highest
 termfrequency first.  To perform a skip_to() operation with ID "minid",
 we'd
 call skip_to(minid) on each of the sub-postlists until one of them moved
 to
 minid.  At that point, there would be no need to call skip_t() on the
 remaining
 sub-postlists.

 We'd need to keep track of which sub-postlists have been moved up to the
 current
 position, and which haven't.  When next() is called, we'd call next() on
 any
 sub-postlists which are up-to-date, but we would need to call skip_to() on
 any
 other sub-postlists which are further behind.

 We could implement this by introducing special BooleanOrPostingList to
 handle
 this particular case.

 I'm not sure that it would lead to much improvement for common cases, but
 it is
 easy to make up cases where it would make a huge improvement (eg, a
 boolean OR
 in which one of the terms is a MatchAll term - none of the other postlists
 would
 need to be moved in this case.)

--

Comment:

 Hmm, if weights are coming from an external source, then this becomes more
 interesting...

-- 
Ticket URL: <http://trac.xapian.org/ticket/215#comment:4>
Xapian <http://trac.xapian.org>
Xapian



More information about the Xapian-tickets mailing list