[Xapian-discuss] Filter similar results

Olly Betts olly at survex.com
Mon Sep 18 08:05:00 BST 2006


On Sat, Sep 16, 2006 at 10:57:58PM -0500, Robby Walker wrote:
> >My thought would be to enhance the collapse feature to allow collapsing
> >to leave N documents with an identical collapse key instead of just one.
>
> I've been thinking this over for a couple of days now, and I think in
> fact going one step more generic might be even better.  Add the
> concept of an MSetDecider that takes an MSetItem and an MSet and
> decides what item to remove (if any).

It's certainly an interesting idea.  I wonder if it might be too generic
though.  There's some fairly subtle stuff going on during the match
which could be rather hard to abstract out cleanly.

It would also be unhelpful if it prevented possible future optimisations
(on the flip side, it might make some easier).

> The existing code would work as an MSetDecider based on collapse_key,
> Weight class, and sort_key.  You could easily write an MSetDecider
> that also uses collapse_count.

Beware that collapse count is just how many duplicates we've seen with
the same value - all you can really rely on is that this is less than or
equal to the total number in the database with that value.

> For my case, I could write a completely different MSetDecider that
> decides partly based on the number of documents found so far in each
> set.

The problem with this would be that asking for more matches can easily
change the number of documents seen so far for each set when particular
documents are considered.  So the results with this MSetDecider would be
rather brittle - e.g. you couldn't ask for hits 1-10 and then hits 11-20
and necessarily expect each hit to be the same as the result of asking
for hits 1-20.  So you couldn't implement paging through hits except by
limiting how far the user can page and always fetching that number.

> For a simpler example (still semi-applicable to my problem) let's
> assume that a user wants to sort based on similarity.  So, maybe we're
> searching an employee database and we say something like
> Lastname:Walter AND UsesGmail AND LikesXapian.  And maybe we want to
> sort partially based on similarity to last name.  So I (Walker) would
> show up early based on my last name's similarity to Walter.  I don't
> believe a ranking like this can be done currently.  Is that true?

If "Lastname:" maps to a term prefix, then you either have to conflate
similar names at index time (e.g. use metaphone to generate "sounds
like" terms, much like how stemming is done) or at query parsing time
(e.g. generate a list of similar sounding names and OR them together,
much like how wildcard expansion is done).  By the time you get to the
matcher you just have the posting lists for the terms in the query, so
your change wouldn't make any difference to what was possible.

If "Lastname:" maps to a document value, you could do it using a
MatchBiasFunctor - this doesn't quite exist yet, but someone posted a
patch which is a lot of what's required.  With your change, you'd
implement something very like a MatchBiasFunctor but as an MSetDecider.

> So, it would add a few nice ways to customize Xapian.  On the down
> side, it puts one or more virtual function calls in the middle of an
> inner loop

I'm not sure this is inherently a problem.  Almost all the posting list 
methods are virtual, and virtual method calls aren't that slow.  But if
it is measurably slower overall (for whatever reason) that would be a
bad thing of course.

One issue to bear in mind is that this is unlikely to be feasible to
use from bindings in other languages with a large database (calling
a C++ virtual method which is actually implemented in Python, PHP, Perl,
etc is almost certainly going to be too much overhead to do for every
single document considered).

> and probably complicates the Remote communication. (for the
> time being, you could only support the default MSetDecider over remote
> conns)

The remote server just sends an MSet back, and this gets fed into the
matcher.  We tried streaming posting lists across the network and it's
just too slow, even with fast ethernet.  Even streaming a posting list
filtered through an MSet on the server was slower than waiting and
sending the final MSet.

I suspect those tests may predate 1Gb ethernet, but I don't think that
would make enough difference, and we want the remote backend to be
usable over slower networks too.

You might just be able to ensure that the MSetDecider is also installed
on the server (like you have to with custom Weight subclasses) and then
return the final MSet.  I suspect that's not going to work for every
possible MSetDecider though.

> I'm willing to write the patch provided you tell me it's 1) not stupid
> and 2) maybe even useful.  Thoughts?

I'm certainly not without reservations, but I'm not convinced it's
impossible to achieve.  I wouldn't dismiss the idea out of hand, but
I also can't promise we'd accept such a patch before it's written.

Cheers,
    Olly



More information about the Xapian-discuss mailing list