[Xapian-discuss] writing match deciders / custom handling of terms

Olly Betts olly at survex.com
Thu Nov 13 13:03:09 GMT 2008


On 11/11/2008, djcb <djcb.bulk at gmail.com> wrote:
> On Tue, 11 Nov 2008, Olly Betts wrote:
> > In this case I'd probably just generate a term for each flag at index time
> > and use QueryParser::set_boolean_prefix().
>
> Hmmm... that would indeed work if I have only a handful of flags; it
> does not seem to work though with the more general case;

It works well in the case where you want to precisely match an exact value,
or one of a small number of values.  It doesn't really matter if
there's a handful
of flags or thousands, it's how you want to search for them which matters.

If you want ranges, then using a value and ValueRangeProcessor makes more sense.

> another search
> criterion would match all mails more recent than three weeks, for which
> I'd use something like:
>
>          date:3w..
>
> and for message size, maybe:
>
>         size:3k..3M
>
> to match messages between 3Kb and 3Mb. I guess I need to do some custom
> handling there... is this what is discussed in ticket #220?
>        http://trac.xapian.org/ticket/220

Kind of.  That ticket is actually about a missing feature in two of the standard
VRP subclasses compared to the third.

> Now, the 'AuthorValueRangeProcessor' looks easy enough; would something
> similar work for my date: / size: above?

Yes (though I'm not sure how an empty range end is handled currently).

>  > > [2] But: there are some things that seem a bit strange though; e.g. there seems
> > >  to be no API to add the prefix to add_term, requiring me to manually
> > >  prefix the strings, which seems a bit hackish...
> >
> > Well, TermGenerator can do prefixing for you.  But it's mostly just string
> > concatenation anyway.
>
> Yes -- but that was my point, when I use add_term (I don't want to use
> the TermGenerator for known atomic strings), I have to do it by hand,
> which requires me to use some internal representation (the prefix) that
> other functions understand. I think it would be nicer to hide that
> implementation detail from the programmer.

As things stand, this string concatenation isn't really hidden anywhere.  It's
more of a standard convention than an implementation detail.

It perhaps should be more hidden, but that's Xapian 2.0 (or N.0) territory.

> > >  and the Xapian::Sorter
> > >  which returns a string, which is then sorted; I was expecting something
> > >  similar to std::less, or GCompareFunc in GLib
> >
> > The reason for generating the sort key rather than offering a comparator
> > is mostly down to the number of callbacks required - for a comparator
> > it's O(n.log(n)) while for generating a sort key it's O(n).
> >
> > Since n can easily be millions, this can make quite a difference.
>
> True; but also a bit misleading; the complexity of the whole sorting
> operation is O(n.log(n)) in both cases

Well, bin sort can be O(n*max_string_length), but you can't do bin sort if you
abstract out to a comparison function.  We don't use bin sort currently, but
it would be nice not to block off such possibilities.

> creating a million sortable
> string representations of some value might be quite expensive. And the
> overall performance will be dominated by the actual sorting, which can
> be faster with comparators versus std::string ==.

Actually, std::string::operator<, but I don't see why a comparator
would be faster
than that since it's just comparing a known number of bytes and will stop when
it finds two which differ.

For a complex comparison (e.g. euclidean distance from a given point) having to
calculate each time is a definite loss.

Another benefit is that it avoids the issue of the user's comparator not being
valid, which causes undefined behaviour which can cause a segmentation
fault (or indeed pretty much anything else).

Even ruling out deliberate abuse and outright stupidity, excess precision can
cause this when the code looks benign (as in the bug fixed in 1.0.9).

Cheers,
    Olly



More information about the Xapian-discuss mailing list