[Xapian-discuss] Matchspy and faceting

Richard Boulton richard at tartarus.org
Sun Aug 29 11:21:50 BST 2010


> 1.I thought the method on the MVCMS was get_top_values but the class seems
> to have the methods top_values_begin and top_values_end instead, which do
> seem to work though I'm unsure what the arguments are.  Can you confirm this
> is right, and since I only found this by reading the source code, is there
> any matchspy documentation?

In C++, the methods are top_values_begin() and top_values_end(). (or
values_begin() and values_end() if you want all values).  These
methods return C++ iterators, which are a bit awkward to use from PHP,
but should work fine.  top_values_begin() and _end() take one
parameter, which is the number of values to return.

See xapian-bindings/php/smoketest.php  for an example of how to use
values_begin() - top_values_begin() is just the same, except for the
extra argument.  Basically, you get an iterator from values_begin(),
compare it to values_end() to check if it's reached the end with the
"equals" method, and if it hasn't you can get the current value
pointed to with $iterator->get_term(), and the count for that value
with $iterator->get_termfreq().  Then, you can move the iterator to
the next position with $iterator->next().

There's no documentation of this specifically for PHP, but the general
C++ documentation for this stuff is in
xapian-core/docs/categorisation.rst

> 2. It seems like you have to fetch an mset after attaching the matchspy
> (even if you don't need one) before the matchspy will return results. Is
> that right?

Yes, that's correct.  The matchspy basically "spies" on the operation
of the matcher as it's computing results.  If you don't need the
results in the mset you can set the maxitems parameter to get_mset()
to 0, which will stop Xapian doing the work of sorting the top
results.  However, you'll need to set the "checkatleast" parameter to
a reasonably high value, or the matcher won't bother actually looking
at any documents.  There's some discussion of this in
docs/categorisation.rst

> 3. Can you attach multiple matchspys to the same enquire?  I've tried to do
> this but it seems to segfault somewhere on or after the third, so currently
> I'm having to set up a new enquire for each matchspy (one per taxonomy)
> which must be running the query again I'd have thought.

This should work fine - I've certainly performed searches with many
matchspys attached from python and from C++.  If you could post some
sample code which fails in this way, we might be able to spot
something wrong in the way in which you're calling the matchspy, but
it's possible there's a bug in the PHP bindings for Xapian here too.

> 4. What's the best strategy for achieving ideal facet suggestions?  In some
> cases, where two or more tags from the same taxonomy are regularly used on
> the same post, eg people or companies, it may make sense to suggest tags
> from a matchspy on the query that is running, so if your query includes
> 'barack obama' for example, you get facet suggestions in the people taxonomy
> that frequently co-occur with Obama.  It would seem to make sense to AND
> these with Obama if they are chosen.  However, there are other taxonomies in
> which tags are all mutually exclusive, eg 'access level', where a post can
> only have one access level.  The logic above would therefore not produce any
> suggestions beyond the one you've already queried on.  So it might be better
> to produce suggestions for each taxonomy based on a query that excludes any
> selected tags in that taxonomy but applies filters on tags selected in other
> taxonomies.

I think that's usually the best strategy.  Alternatively, I've seen
many sites where a category can only have a single value chosen in it,
which avoids this problem altogether, and is sometimes not a
significant limitation.  Usually "OR"ing together multiple values
chosen in a since category is the natural thing to do - though as you
indicate in your examples, some facets might want special behaviour
such as ANDing values together.  I think you need to manually choose
what behaviour is suitable for each category, based on the type of
data found in the category.

>  I'm interested in whether anyone's looked at this area of
> faceting in detail and had this problem before.

I'd be interested in tales from users here, too!

-- 
Richard



More information about the Xapian-discuss mailing list