[Xapian-tickets] [Xapian] #433: MatchSpy should allow early termination
Xapian
nobody at xapian.org
Thu Apr 16 23:24:13 BST 2015
#433: MatchSpy should allow early termination
-------------------------+------------------------------
Reporter: richard | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone: 1.3.x
Component: Matcher | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+------------------------------
Description changed by olly:
Old description:
> I've recently come across some situations where it would be nice to have
> a !MatchSpy at the start of a search, but after the !MatchSpy has seen a
> certain set of values, it's easy to tell that the !MatchSpy won't benefit
> significantly from seeing any further values. This is particularly
> attractive in a database in which searches are being biased by a
> document-specific weight, such that the documents with a high weight are
> first, since the inaccuracy of early termination is often less important
> in this case. Three types of example:
>
> 1. A !MatchSpy which is checking for the existence of a particular flag
> in any of the matching documents. (eg, checking to see if any of the
> matching documents are products which are "special offers"). Once any
> such document has been seen, the !MatchSpy could be discarded.
> 1. A !MatchSpy which is checking for facets, but for which only
> approximate results and no counts are required, could terminate after
> seeing a specific maximum number of results (say, 1000). We could just
> set a checkatleast value of 1000 for the search, except that there may be
> other aspects of the search causing a higher checkatleast value to be
> required (for example, other facets for which a more exact count is
> required, or a !MatchSpy like that in the example above checking for
> special offers. Such a !MatchSpy needs to see all potential matches, but
> may terminate at any time.).
> 1. A !MatchSpy may only be interested in the documents with the top N
> weights. For example, when using a diversity reweighting algorithm, it's
> useful to know the categories of the top N weighted documents in each of
> several categories (this is similar to a collapse, but we want to ensure
> that if there are sufficient matching documents, each of the categories
> has a full complement of N matches found). Such a !MatchSpy could report
> the lowest weight of document that it's currently interested in to the
> matcher, potentially allowing the min_weight parameter passed to the
> posting list tree to be increased, rather than us simply passing
> checkatleast=doccount as is currently needed to get this guarantee.
>
> Currently, such matchspies can simply stop doing any work after they've
> seen enough information, but it would be more efficient if they could
> tell the matcher that they are finished, so that the matcher could
> terminate early.
>
> Several possible implementations spring to mind:
>
> a. Change !MatchSpy::operator() to return a bool; interpret a "true"
> value as "I'm finished". This would allow case 1 above to be handled
> nicely. Case 2 could also be handled in this way, but would require an
> appropriate checkatleast value to be set too.
> a. Change !MatchSpy::operator() to return a float, indicating the
> minimum weight that the !MatchSpy is now interested in. We could
> interpret a value of DBL_MAX as "I'm finished". This would support case
> 1 and 2 as before, but also support case 3. If we combine this with the
> min_weight value used by the existing matcher code, we could also remove
> the need to set a high checkatleast value for matchspies which need to
> see lots of documents - the matchspy would simply return 0.0 until it has
> seen enough documents, at which point it could return a high value
> instead.
> a. As for implementation (b), but in addition allow !MatchSpies to
> remove themselves from the matcher by a less hacky means than returning
> DBL_MAX. One approach could be to use a similar approach to
> !PostingSources (of registering the source with the matcher using a
> method on the base class); add a method on the base !MatchSpy class to
> tell the matcher that the matchspy can be removed. Actually, this
> approach could be used instead of a return value from
> !MatchSpy::operator() for setting a new minumum weight, too, but it seems
> likely to be less lightweight than using the return value.
>
> Marking for 1.1.7, since implementing these might require API changes.
New description:
I've recently come across some situations where it would be nice to have a
!MatchSpy at the start of a search, but after the !MatchSpy has seen a
certain set of values, it's easy to tell that the !MatchSpy won't benefit
significantly from seeing any further values. This is particularly
attractive in a database in which searches are being biased by a document-
specific weight, such that the documents with a high weight are first,
since the inaccuracy of early termination is often less important in this
case. Three types of example:
1. A !MatchSpy which is checking for the existence of a particular flag
in any of the matching documents. (eg, checking to see if any of the
matching documents are products which are "special offers"). Once any
such document has been seen, the !MatchSpy could be discarded.
1. A !MatchSpy which is checking for facets, but for which only
approximate results and no counts are required, could terminate after
seeing a specific maximum number of results (say, 1000). We could just
set a checkatleast value of 1000 for the search, except that there may be
other aspects of the search causing a higher checkatleast value to be
required (for example, other facets for which a more exact count is
required, or a !MatchSpy like that in the example above checking for
special offers. Such a !MatchSpy needs to see all potential matches, but
may terminate at any time.).
1. A !MatchSpy may only be interested in the documents with the top N
weights. For example, when using a diversity reweighting algorithm, it's
useful to know the categories of the top N weighted documents in each of
several categories (this is similar to a collapse, but we want to ensure
that if there are sufficient matching documents, each of the categories
has a full complement of N matches found). Such a !MatchSpy could report
the lowest weight of document that it's currently interested in to the
matcher, potentially allowing the min_weight parameter passed to the
posting list tree to be increased, rather than us simply passing
checkatleast=doccount as is currently needed to get this guarantee.
Currently, such matchspies can simply stop doing any work after they've
seen enough information, but it would be more efficient if they could tell
the matcher that they are finished, so that the matcher could terminate
early.
Several possible implementations spring to mind:
a. Change `MatchSpy::operator()` to return a bool; interpret a "true"
value as "I'm finished". This would allow case 1 above to be handled
nicely. Case 2 could also be handled in this way, but would require an
appropriate checkatleast value to be set too.
a. Change `MatchSpy::operator()` to return a double, indicating the
minimum weight that the !MatchSpy is now interested in. We could
interpret a value of `DBL_MAX` as "I'm finished". This would support case
1 and 2 as before, but also support case 3. If we combine this with the
min_weight value used by the existing matcher code, we could also remove
the need to set a high checkatleast value for matchspies which need to see
lots of documents - the matchspy would simply return 0.0 until it has seen
enough documents, at which point it could return a high value instead.
a. As for implementation (b), but in addition allow !MatchSpies to remove
themselves from the matcher by a less hacky means than returning
`DBL_MAX`. One approach could be to use a similar approach to
!PostingSources (of registering the source with the matcher using a method
on the base class); add a method on the base !MatchSpy class to tell the
matcher that the matchspy can be removed. Actually, this approach could
be used instead of a return value from !MatchSpy::operator() for setting a
new minimum weight, too, but it seems likely to be less lightweight than
using the return value.
Marking for 1.1.7, since implementing these might require API changes.
--
--
Ticket URL: <http://trac.xapian.org/ticket/433#comment:7>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list