[Xapian-tickets] [Xapian] #215: Boolean OR could be optimised further
Xapian
nobody at xapian.org
Thu Apr 24 03:13:48 BST 2008
#215: Boolean OR could be optimised further
-------------------------+--------------------------------------------------
Reporter: richard | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Matcher | Version: SVN HEAD
Severity: minor | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
-------------------------+--------------------------------------------------
Changes (by olly):
* owner: newbugs => olly
* status: assigned => new
Old description:
> For a boolean OR of several terms, we don't care how many of the terms
> match -
> we just care whether any of them do. Therefore, when we find that one of
> the
> terms matches, we shouldn't waste effort trying to move the other posting
> lists
> to the same position.
>
> Instead, we should hold the posting lists in order of termfrequency,
> highest
> termfrequency first. To perform a skip_to() operation with ID "minid",
> we'd
> call skip_to(minid) on each of the sub-postlists until one of them moved
> to
> minid. At that point, there would be no need to call skip_t() on the
> remaining
> sub-postlists.
>
> We'd need to keep track of which sub-postlists have been moved up to the
> current
> position, and which haven't. When next() is called, we'd call next() on
> any
> sub-postlists which are up-to-date, but we would need to call skip_to()
> on any
> other sub-postlists which are further behind.
>
> We could implement this by introducing special BooleanOrPostingList to
> handle
> this particular case.
>
> I'm not sure that it would lead to much improvement for common cases, but
> it is
> easy to make up cases where it would make a huge improvement (eg, a
> boolean OR
> in which one of the terms is a MatchAll term - none of the other
> postlists would
> need to be moved in this case.)
New description:
For a boolean OR of several terms, we don't care how many of the terms
match -
we just care whether any of them do. Therefore, when we find that one of
the
terms matches, we shouldn't waste effort trying to move the other posting
lists
to the same position.
Instead, we should hold the posting lists in order of termfrequency,
highest
termfrequency first. To perform a skip_to() operation with ID "minid",
we'd
call skip_to(minid) on each of the sub-postlists until one of them moved
to
minid. At that point, there would be no need to call skip_t() on the
remaining
sub-postlists.
We'd need to keep track of which sub-postlists have been moved up to the
current
position, and which haven't. When next() is called, we'd call next() on
any
sub-postlists which are up-to-date, but we would need to call skip_to() on
any
other sub-postlists which are further behind.
We could implement this by introducing special BooleanOrPostingList to
handle
this particular case.
I'm not sure that it would lead to much improvement for common cases, but
it is
easy to make up cases where it would make a huge improvement (eg, a
boolean OR
in which one of the terms is a MatchAll term - none of the other postlists
would
need to be moved in this case.)
--
Comment:
Hmm, if weights are coming from an external source, then this becomes more
interesting...
--
Ticket URL: <http://trac.xapian.org/ticket/215#comment:4>
Xapian <http://trac.xapian.org>
Xapian
More information about the Xapian-tickets
mailing list