[Xapian-tickets] [Xapian] #433: MatchSpy should allow early termination

Xapian nobody at xapian.org
Mon Feb 1 17:36:31 GMT 2010


#433: MatchSpy should allow early termination
-------------------------+--------------------------------------------------
 Reporter:  richard      |       Owner:  olly     
     Type:  enhancement  |      Status:  new      
 Priority:  normal       |   Milestone:  1.1.7    
Component:  Matcher      |     Version:  SVN trunk
 Severity:  normal       |    Keywords:           
Blockedby:               |    Platform:  All      
 Blocking:               |  
-------------------------+--------------------------------------------------
 I've recently come across some situations where it would be nice to have a
 !MatchSpy at the start of a search, but after the !MatchSpy has seen a
 certain set of values, it's easy to tell that the !MatchSpy won't benefit
 significantly from seeing any further values.  This is particularly
 attractive in a database in which searches are being biased by a document-
 specific weight, such that the documents with a high weight are first,
 since the inaccuracy of early termination is often less important in this
 case.  Three types of example:

  1. A !MatchSpy which is checking for the existence of a particular flag
 in any of the matching documents.  (eg, checking to see if any of the
 matching documents are products which are "special offers").  Once any
 such document has been seen, the !MatchSpy could be discarded.
  1. A !MatchSpy which is checking for facets, but for which only
 approximate results and no counts are required, could terminate after
 seeing a specific maximum number of results (say, 1000).  We could just
 set a checkatleast value of 1000 for the search, except that there may be
 other aspects of the search causing a higher checkatleast value to be
 required (for example, other facets for which a more exact count is
 required, or a !MatchSpy like that in the example above checking for
 special offers.  Such a !MatchSpy needs to see all potential matches, but
 may terminate at any time.).
  1. A !MatchSpy may only be interested in the documents with the top N
 weights.  For example, when using a diversity reweighting algorithm, it's
 useful to know the categories of the top N weighted documents in each of
 several categories (this is similar to a collapse, but we want to ensure
 that if there are sufficient matching documents, each of the categories
 has a full complement of N matches found).  Such a !MatchSpy could report
 the lowest weight of document that it's currently interested in to the
 matcher, potentially allowing the min_weight parameter passed to the
 posting list tree to be increased, rather than us simply passing
 checkatleast=doccount as is currently needed to get this guarantee.

 Currently, such matchspies can simply stop doing any work after they've
 seen enough information, but it would be more efficient if they could tell
 the matcher that they are finished, so that the matcher could terminate
 early.

 Several possible implementations spring to mind:

  a. Change !MatchSpy::operator() to return a bool; interpret a "true"
 value as "I'm finished".  This would allow case 1 above to be handled
 nicely.  Case 2 could also be handled in this way, but would require an
 appropriate checkatleast value to be set too.
  a. Change !MatchSpy::operator() to return a float, indicating the minimum
 weight that the !MatchSpy is now interested in.  We could interpret a
 value of DBL_MAX as "I'm finished".  This would support case 1 and 2 as
 before, but also support case 3.  If we combine this with the min_weight
 value used by the existing matcher code, we could also remove the need to
 set a high checkatleast value for matchspies which need to see lots of
 documents - the matchspy would simply return 0.0 until it has seen enough
 documents, at which point it could return a high value instead.
  a. As for implementation (b), but in addition allow !MatchSpies to remove
 themselves from the matcher by a less hacky means than returning DBL_MAX.
 One approach could be to use a similar approach to !PostingSources (of
 registering the source with the matcher using a method on the base class);
 add a method on the base !MatchSpy class to tell the matcher that the
 matchspy can be removed.  Actually, this approach could be used instead of
 a return value from !MatchSpy::operator() for setting a new minumum
 weight, too, but it seems likely to be less lightweight than using the
 return value.

 Marking for 1.1.7, since implementing these might require API changes.

-- 
Ticket URL: <http://trac.xapian.org/ticket/433>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list