[Xapian-discuss] Unique Term Listings

Olly Betts olly at survex.com
Wed Nov 15 19:23:37 GMT 2006


On Tue, Nov 14, 2006 at 11:19:35AM +0000, Martin Hearn wrote:
> I'm trying to achieve something similar to what ebay does.
> E.g. If I search for "harry potter" on ebay I get a list of over 50  
> categories with the number of entries for each category.

An issue here is that you presumably want a full list of the categories
with exact counts in each.  If you want a full list of categories, then
query expansion isn't going to work anyway - it's good for finding the
best N categories.

Xapian is built with the idea that we want to do as little work as
possible while still returning the best N results.  So it will try
not to look consider every document.  For example, the query "A OR B"
can often be ended when we reach the last document indexed by one
of the terms, if we can see that the maximum weight we could get
from the other term is lower than the lowest weight in our candidate
set.

But to get a full list of categories and counts, you need to consider
all matching documents.  So while you can achieve this, you shouldn't
be suprised if it's slower than just searching for 10 matches.

The first thing you'll need to do is put the category (or category list
if there are several) into a document value at index time.

Then the trick is to use a MatchDecider to spy on the matcher (rather
than to decide which matches are wanted - it always says "yes" for
that).  Our "spy" will simply check the value with the category in
and update a std::map with counts (if categories are numeric, this
could just be an array).  If there are multiple categories, then
just list them in the value (e.g. tab separated) and split and tally
each in the spy.

It's easier to code this than explain in words, so here's a patch
against the "quest" example to show what I mean:

http://www.oligarchy.co.uk/xapian/patches/match-spy-for-categories.patch

When searching, we set check_at_least to some high value when calling
get_mset so that the matcher will consider at least that many matching
documents before giving up early.  You could use (Xapian::doccount)-1
here, but I'd suggest imposing a sane limit here since you probably
don't really want to build an exact set of categories for more a million
matches.  So I impose a limit of a million and say "274+" instead of
"274" next to category counts if we hit this limit.

While testing this, I've noticed that check_at_least seems to actually
return all the matches (which it shouldn't) so this technique may be
faster than the patch currently suggests.  I'll take a look and fix the
check_at_least handling if it is misbehaving.

Cheers,
    Olly



More information about the Xapian-discuss mailing list