[Xapian-discuss] Unique Term Listings

Olly Betts olly at survex.com
Thu Nov 16 05:39:37 GMT 2006


On Wed, Nov 15, 2006 at 05:03:22PM +0000, James Aylett wrote:
> Can we not do this using collapse keys?

You could run the query twice, once using collapsing and once to get the
uncollapsed "best N" results to show to the user.

The problem is that the "collapse count" which the MSet can return is
only a lower bound.  It's just the number of entries which the matcher
actually looked at the collapse value for which have the same value.
But the matcher may have discarded some documents by weight alone which
would have been collapsed.

But I think you could make this work if you set the requested MSet size
to more than the number of different categories for the "collapsing
match" - then the collapse counts should be exact (because you'd get one
MSet entry per used category when collapsing).  Note that the
"collapse_count" doesn't include the document which is left in the MSet,
so add one to get the actual category counts!

This approach is probably comparable in speed to my "spy" approach, but
is limited to a single category per document whereas the spy is more
versatile.

On the plus side, this can be implemented from any scripting language
via the bindings (only some languages currently support subclassing
Xapian::MatchDecider).  Also the overhead of a callback from C++ to the
scripting language for every document in a large database would almost
certainly mean this approach would be faster even where subclassing is
supported.

Cheers,
    Olly



More information about the Xapian-discuss mailing list