[Xapian-discuss] Filter similar results

Tue Sep 12 23:06:50 BST 2006

On Tue, Sep 12, 2006 at 02:35:42PM -0500, Robby Walker wrote:
> I'm trying to implement something like Google's "similar results
> omitted" and I'm not sure how to go about it.

> One option is to do multiple smaller searches - one per set - this
> seems inefficient?  Is the cost of search weighted towards setting up
> the search or is it more affected by the number of desired results?

For a large system, I/O will dominate.  Once the blocks involved are in
the cache, the search is much faster - try repeating a search which was
slow and you'll generally find it's much quicker.

So running two searches for similar things will typically be quicker
than twice the time to do one, and the worst case performance is
definitely much better.  Plus if this is a web search, you only incur
the browser/server roundtrip overhead once so the perceived time to
the end user is certainly less than doubled.

> What happens when I'm searching 20 or more different sets?

It's hard to give a general answer - it will depends on many things, not
least what an acceptable search time is for your users.

> Another option would be to use MatchDecider and only "OK" the first
> few documents from each set.  Is it true that documents will be tested
> by MatchDecider in any sort of order - e.g. highest relevancy first?

The order is essentially unspecified.  It's probably currently ascending
docid order, but even that's not something you should rely on - I plan
to implement a new optimisation to delay applying the MatchDecider so
we end up testing fewer documents, and that will change the order.

> A third option is to do a standard relevancy search over all sets.  If
> one set dominates this result, I can search again filtering out that
> set and merge the results.  Repeat if necessary.  Again the question
> here is what part of search is most expensive.
> 
> Finally, the fourth option is to just return way more documents than I
> need and then go filter it manually.

The main problem for the last option is that many of the matcher's
optimisations are less effective when you ask for a larger MSet.
But perhaps cheaper than doing N searches.

> Any of these seem good?  Or am I missing an easier way to do this?

My thought would be to enhance the collapse feature to allow collapsing
to leave N documents with an identical collapse key instead of just one.
It's probably a comparable amount of work to trying to do it outside of
Xapian and you get a much more satisfactory solution.

Cheers,
    Olly