[Xapian-tickets] [Xapian] #433: MatchSpy should allow early termination

Xapian nobody at xapian.org
Thu Apr 16 23:24:13 BST 2015


#433: MatchSpy should allow early termination
-------------------------+------------------------------
 Reporter:  richard      |             Owner:  olly
     Type:  enhancement  |            Status:  new
 Priority:  normal       |         Milestone:  1.3.x
Component:  Matcher      |           Version:  SVN trunk
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+------------------------------
Description changed by olly:

Old description:

> I've recently come across some situations where it would be nice to have
> a !MatchSpy at the start of a search, but after the !MatchSpy has seen a
> certain set of values, it's easy to tell that the !MatchSpy won't benefit
> significantly from seeing any further values.  This is particularly
> attractive in a database in which searches are being biased by a
> document-specific weight, such that the documents with a high weight are
> first, since the inaccuracy of early termination is often less important
> in this case.  Three types of example:
>
>  1. A !MatchSpy which is checking for the existence of a particular flag
> in any of the matching documents.  (eg, checking to see if any of the
> matching documents are products which are "special offers").  Once any
> such document has been seen, the !MatchSpy could be discarded.
>  1. A !MatchSpy which is checking for facets, but for which only
> approximate results and no counts are required, could terminate after
> seeing a specific maximum number of results (say, 1000).  We could just
> set a checkatleast value of 1000 for the search, except that there may be
> other aspects of the search causing a higher checkatleast value to be
> required (for example, other facets for which a more exact count is
> required, or a !MatchSpy like that in the example above checking for
> special offers.  Such a !MatchSpy needs to see all potential matches, but
> may terminate at any time.).
>  1. A !MatchSpy may only be interested in the documents with the top N
> weights.  For example, when using a diversity reweighting algorithm, it's
> useful to know the categories of the top N weighted documents in each of
> several categories (this is similar to a collapse, but we want to ensure
> that if there are sufficient matching documents, each of the categories
> has a full complement of N matches found).  Such a !MatchSpy could report
> the lowest weight of document that it's currently interested in to the
> matcher, potentially allowing the min_weight parameter passed to the
> posting list tree to be increased, rather than us simply passing
> checkatleast=doccount as is currently needed to get this guarantee.
>
> Currently, such matchspies can simply stop doing any work after they've
> seen enough information, but it would be more efficient if they could
> tell the matcher that they are finished, so that the matcher could
> terminate early.
>
> Several possible implementations spring to mind:
>
>  a. Change !MatchSpy::operator() to return a bool; interpret a "true"
> value as "I'm finished".  This would allow case 1 above to be handled
> nicely.  Case 2 could also be handled in this way, but would require an
> appropriate checkatleast value to be set too.
>  a. Change !MatchSpy::operator() to return a float, indicating the
> minimum weight that the !MatchSpy is now interested in.  We could
> interpret a value of DBL_MAX as "I'm finished".  This would support case
> 1 and 2 as before, but also support case 3.  If we combine this with the
> min_weight value used by the existing matcher code, we could also remove
> the need to set a high checkatleast value for matchspies which need to
> see lots of documents - the matchspy would simply return 0.0 until it has
> seen enough documents, at which point it could return a high value
> instead.
>  a. As for implementation (b), but in addition allow !MatchSpies to
> remove themselves from the matcher by a less hacky means than returning
> DBL_MAX.  One approach could be to use a similar approach to
> !PostingSources (of registering the source with the matcher using a
> method on the base class); add a method on the base !MatchSpy class to
> tell the matcher that the matchspy can be removed.  Actually, this
> approach could be used instead of a return value from
> !MatchSpy::operator() for setting a new minumum weight, too, but it seems
> likely to be less lightweight than using the return value.
>
> Marking for 1.1.7, since implementing these might require API changes.

New description:

 I've recently come across some situations where it would be nice to have a
 !MatchSpy at the start of a search, but after the !MatchSpy has seen a
 certain set of values, it's easy to tell that the !MatchSpy won't benefit
 significantly from seeing any further values.  This is particularly
 attractive in a database in which searches are being biased by a document-
 specific weight, such that the documents with a high weight are first,
 since the inaccuracy of early termination is often less important in this
 case.  Three types of example:

  1. A !MatchSpy which is checking for the existence of a particular flag
 in any of the matching documents.  (eg, checking to see if any of the
 matching documents are products which are "special offers").  Once any
 such document has been seen, the !MatchSpy could be discarded.
  1. A !MatchSpy which is checking for facets, but for which only
 approximate results and no counts are required, could terminate after
 seeing a specific maximum number of results (say, 1000).  We could just
 set a checkatleast value of 1000 for the search, except that there may be
 other aspects of the search causing a higher checkatleast value to be
 required (for example, other facets for which a more exact count is
 required, or a !MatchSpy like that in the example above checking for
 special offers.  Such a !MatchSpy needs to see all potential matches, but
 may terminate at any time.).
  1. A !MatchSpy may only be interested in the documents with the top N
 weights.  For example, when using a diversity reweighting algorithm, it's
 useful to know the categories of the top N weighted documents in each of
 several categories (this is similar to a collapse, but we want to ensure
 that if there are sufficient matching documents, each of the categories
 has a full complement of N matches found).  Such a !MatchSpy could report
 the lowest weight of document that it's currently interested in to the
 matcher, potentially allowing the min_weight parameter passed to the
 posting list tree to be increased, rather than us simply passing
 checkatleast=doccount as is currently needed to get this guarantee.

 Currently, such matchspies can simply stop doing any work after they've
 seen enough information, but it would be more efficient if they could tell
 the matcher that they are finished, so that the matcher could terminate
 early.

 Several possible implementations spring to mind:

  a. Change `MatchSpy::operator()` to return a bool; interpret a "true"
 value as "I'm finished".  This would allow case 1 above to be handled
 nicely.  Case 2 could also be handled in this way, but would require an
 appropriate checkatleast value to be set too.
  a. Change `MatchSpy::operator()` to return a double, indicating the
 minimum weight that the !MatchSpy is now interested in.  We could
 interpret a value of `DBL_MAX` as "I'm finished".  This would support case
 1 and 2 as before, but also support case 3.  If we combine this with the
 min_weight value used by the existing matcher code, we could also remove
 the need to set a high checkatleast value for matchspies which need to see
 lots of documents - the matchspy would simply return 0.0 until it has seen
 enough documents, at which point it could return a high value instead.
  a. As for implementation (b), but in addition allow !MatchSpies to remove
 themselves from the matcher by a less hacky means than returning
 `DBL_MAX`.  One approach could be to use a similar approach to
 !PostingSources (of registering the source with the matcher using a method
 on the base class); add a method on the base !MatchSpy class to tell the
 matcher that the matchspy can be removed.  Actually, this approach could
 be used instead of a return value from !MatchSpy::operator() for setting a
 new minimum weight, too, but it seems likely to be less lightweight than
 using the return value.

 Marking for 1.1.7, since implementing these might require API changes.

--

--
Ticket URL: <http://trac.xapian.org/ticket/433#comment:7>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list