[Xapian-discuss] Filter or MatchDecider or Other?

Olly Betts olly at survex.com
Fri Aug 20 01:44:20 BST 2004


On Thu, Aug 19, 2004 at 06:24:36PM -0400, Mike Boone wrote:
> My existing search tool allows me to limit the search results based on other
> parameters besides the text content. For example, each document has an
> associated category, and I can limit my search to only a certain category.
> Similarly, I can limit the search by a geographic region, or by a date
> range.
> 
> I've built a Xapian-database of just the text content, and the simplesearch
> tool seems to return relevant results as expected. What's the best way to
> add the limiting to the search?
> 
> It looks like I could maybe add a term of something like
> 'Location:Location1' to the end of my document and then add OP_FILTER to the
> query and require that piece of data.

That's the best way to filter on pre-defined categories.  Pick a syntax
for the term which won't clash with terms generated from the text
(including a colon like you suggest is fine; the convention Omega uses
is that terms from text are lower case, so capital letter prefixes are
used for filter terms).

> Or perhaps I could set the catgory as a document value and then
> implement the MatchDecider to limit on that?

That's likely to be less efficient than using OP_FILTER.  A document
value is designed to be fast to access, but using a term means the
list of document ids is effectively precalculated.

A MatchDecider is useful when the decision is more complicated - for
example you might want to restrict results to "within X miles of P" which
would be hard to do efficiently with OP_FILTER, but a MatchDecider could
take the coordinates from a value and calculate the distance from P,
saying "yes" only if that distance is less than X.

For a date range, I'd suggest considering the scheme Omega uses - there
we generate terms for the date (e.g. D20040820), the month (e.g. M200408),
and the year (e.g. Y2004).  This allows a long date range to be
represented as a relatively small number of terms OR-ed together.  If
you just indexed the D terms, a year span would require 365 terms.

> I haven't found the documentation very clear as to which way will work, or
> which way is preferred/faster. Please point me in the right direction.

Thanks for the feedback - I'll slot the above suggestions into the
documentation in a suitable place.

> BTW, I'm developing this with the PHP Xapian bindings, please let me know if
> a certain feature won't work under this setup.

I'm not certain if MatchDecider is supported by the PHP bindings.  You
need to be able to sub-class it to use it, or supply the decision
function in some other way.  Hopefully someone with more knowledge of
the bindings than me can give a more definitive answer...

Cheers,
    Olly



More information about the Xapian-discuss mailing list