[Xapian-discuss] writing match deciders / custom handling of terms

djcb djcb.bulk at gmail.com
Sat Nov 29 17:57:24 GMT 2008


Hi Olly,

As always thanks for your insightful comments. A great assert for Xapian.

On Thu, 13 Nov 2008, Olly Betts wrote:

> 
> On 11/11/2008, djcb <djcb.bulk at gmail.com> wrote:
> > On Tue, 11 Nov 2008, Olly Betts wrote:
> > > In this case I'd probably just generate a term for each flag at index time
> > > and use QueryParser::set_boolean_prefix().
> >
> > Hmmm... that would indeed work if I have only a handful of flags; it
> > does not seem to work though with the more general case;
> 
> It works well in the case where you want to precisely match an exact value,
> or one of a small number of values.  It doesn't really matter if
> there's a handful
> of flags or thousands, it's how you want to search for them which matters.
> 
> If you want ranges, then using a value and ValueRangeProcessor makes more sense.

Well, I have a bunch of flags, and I'd like to match all new messages that
are encrypted but don't have attachments:
    flags:NX^A
  
(X means 'encrypted', '^' means 'not')

In my current implementation, I simple translate that in number, and do
a bitwise-OR with a number in the database. This is quite hard with
Xapian; it's doable, but it will require quite some tricks.
 
> > Now, the 'AuthorValueRangeProcessor' looks easy enough; would something
> > similar work for my date: / size: above?
> 
> Yes (though I'm not sure how an empty range end is handled currently).

Well, I think I have to pre-process queries before I feed them to the
QueryParser. What would be nice is some control over how individual
elements are parsed (eg. the flags:NX^A example above).

> > creating a million sortable
> > string representations of some value might be quite expensive. And the
> > overall performance will be dominated by the actual sorting, which can
> > be faster with comparators versus std::string ==.
> 
> Actually, std::string::operator<, but I don't see why a comparator
> would be faster
> than that since it's just comparing a known number of bytes and will stop when
> it finds two which differ.

Hmmm, comparing numbers is faster than comparing their
string-representations. A 64 bit number takes 8 bytes, but 20 chars
(with leading '0'). If they're small numbers, we're comparing quite some
0s first. [ But admittedly, Xapian has 'sortable_serialise' which helps
here ].

> For a complex comparison (e.g. euclidean distance from a given point) having to
> calculate each time is a definite loss.

One can show examples either way of course... what about storing IPv6
addresses; they are 128-bits; if I need to allocate a std::string for
each of them that's quite expensive if I have a million; and as they
might have long equal prefixes, sorting them can be rather expensive
too. And sortable_serialise does not work there...
 
> Another benefit is that it avoids the issue of the user's comparator not being
> valid, which causes undefined behaviour which can cause a segmentation
> fault (or indeed pretty much anything else).

> Even ruling out deliberate abuse and outright stupidity, excess precision can
> cause this when the code looks benign (as in the bug fixed in 1.0.9).

The user might just a well cause a segfault in her function that returns
a std::string.

It would be nice to have typed values in Xapian -- not everything is
best represented as a string. Storing numbers as strings, and then
comparing them as strings just seems a bit suboptimal; the
sortable_(un)serialise treats the symptom to some extent, but the real
issue remains.

Anyway, once more, Xapian is nice piece of software; most of these
things are totally workaroundable, and my datasets are not so big, and
performance is just excellent.

Thanks,
Dirk.

-- 
------------------------------------------------------
Dirk-Jan C. Binnema             <djcb at djcbsoftware.nl>
             http://www.djcbsoftware.nl/ 
PGP: D09C E664 897D 7D39 5047 A178 E96A C7A1 017D DA3C
------------------------------------------------------



More information about the Xapian-discuss mailing list