[Xapian-tickets] [Xapian] #433: MatchSpy should allow early termination
Xapian
nobody at xapian.org
Mon Feb 1 17:36:31 GMT 2010
#433: MatchSpy should allow early termination
-------------------------+--------------------------------------------------
Reporter: richard | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone: 1.1.7
Component: Matcher | Version: SVN trunk
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
-------------------------+--------------------------------------------------
I've recently come across some situations where it would be nice to have a
!MatchSpy at the start of a search, but after the !MatchSpy has seen a
certain set of values, it's easy to tell that the !MatchSpy won't benefit
significantly from seeing any further values. This is particularly
attractive in a database in which searches are being biased by a document-
specific weight, such that the documents with a high weight are first,
since the inaccuracy of early termination is often less important in this
case. Three types of example:
1. A !MatchSpy which is checking for the existence of a particular flag
in any of the matching documents. (eg, checking to see if any of the
matching documents are products which are "special offers"). Once any
such document has been seen, the !MatchSpy could be discarded.
1. A !MatchSpy which is checking for facets, but for which only
approximate results and no counts are required, could terminate after
seeing a specific maximum number of results (say, 1000). We could just
set a checkatleast value of 1000 for the search, except that there may be
other aspects of the search causing a higher checkatleast value to be
required (for example, other facets for which a more exact count is
required, or a !MatchSpy like that in the example above checking for
special offers. Such a !MatchSpy needs to see all potential matches, but
may terminate at any time.).
1. A !MatchSpy may only be interested in the documents with the top N
weights. For example, when using a diversity reweighting algorithm, it's
useful to know the categories of the top N weighted documents in each of
several categories (this is similar to a collapse, but we want to ensure
that if there are sufficient matching documents, each of the categories
has a full complement of N matches found). Such a !MatchSpy could report
the lowest weight of document that it's currently interested in to the
matcher, potentially allowing the min_weight parameter passed to the
posting list tree to be increased, rather than us simply passing
checkatleast=doccount as is currently needed to get this guarantee.
Currently, such matchspies can simply stop doing any work after they've
seen enough information, but it would be more efficient if they could tell
the matcher that they are finished, so that the matcher could terminate
early.
Several possible implementations spring to mind:
a. Change !MatchSpy::operator() to return a bool; interpret a "true"
value as "I'm finished". This would allow case 1 above to be handled
nicely. Case 2 could also be handled in this way, but would require an
appropriate checkatleast value to be set too.
a. Change !MatchSpy::operator() to return a float, indicating the minimum
weight that the !MatchSpy is now interested in. We could interpret a
value of DBL_MAX as "I'm finished". This would support case 1 and 2 as
before, but also support case 3. If we combine this with the min_weight
value used by the existing matcher code, we could also remove the need to
set a high checkatleast value for matchspies which need to see lots of
documents - the matchspy would simply return 0.0 until it has seen enough
documents, at which point it could return a high value instead.
a. As for implementation (b), but in addition allow !MatchSpies to remove
themselves from the matcher by a less hacky means than returning DBL_MAX.
One approach could be to use a similar approach to !PostingSources (of
registering the source with the matcher using a method on the base class);
add a method on the base !MatchSpy class to tell the matcher that the
matchspy can be removed. Actually, this approach could be used instead of
a return value from !MatchSpy::operator() for setting a new minumum
weight, too, but it seems likely to be less lightweight than using the
return value.
Marking for 1.1.7, since implementing these might require API changes.
--
Ticket URL: <http://trac.xapian.org/ticket/433>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list