[Xapian-discuss] writing match deciders / custom handling of terms

Olly Betts olly at survex.com
Mon Dec 1 11:04:55 GMT 2008


On Sat, Nov 29, 2008 at 07:57:24PM +0200, djcb wrote:
> On Thu, 13 Nov 2008, Olly Betts wrote:
> > It works well in the case where you want to precisely match an exact value,
> > or one of a small number of values.  It doesn't really matter if
> > there's a handful
> > of flags or thousands, it's how you want to search for them which matters.

> Well, I have a bunch of flags, and I'd like to match all new messages that
> are encrypted but don't have attachments:
>     flags:NX^A
>   
> (X means 'encrypted', '^' means 'not')
> 
> In my current implementation, I simple translate that in number, and do
> a bitwise-OR with a number in the database. This is quite hard with
> Xapian; it's doable, but it will require quite some tricks.

I don't actually see how you implement this with a bitwise-OR, but
anyway with Xapian I'd just index it as a set of boolean terms - e.g.
XFLAG:N XFLAG:X XFLAG:A and a document only gets the term for a flag if
the flag is set.

Then "flags:NX^A" translates to: ((XFLAG:N AND XFLAG:X) AND_NOT XFLAG:A)

> > Yes (though I'm not sure how an empty range end is handled currently).
> 
> Well, I think I have to pre-process queries before I feed them to the
> QueryParser. What would be nice is some control over how individual
> elements are parsed (eg. the flags:NX^A example above).

We discourage preprocessing input to QueryParser, but you'll probably
have to currently to support that "flags:" syntax.  You can easily
support this syntax though:

flag:N flag:X -flag:A

And that would allow the user to write arbitrary boolean expressions
such as:

(flag:N OR (flag:X NOT flag:A)) XOR flag:Q

And yes, there should be more control possible over parsing.  There some
discussion of ideas for this here: http://trac.xapian.org/ticket/128

> > For a complex comparison (e.g. euclidean distance from a given point) having to
> > calculate each time is a definite loss.
> 
> One can show examples either way of course... what about storing IPv6
> addresses; they are 128-bits; if I need to allocate a std::string for
> each of them that's quite expensive if I have a million; and as they
> might have long equal prefixes, sorting them can be rather expensive
> too. And sortable_serialise does not work there...

But is sorting by IPv6 address ever useful?

Note that a decent STL implementation will pool memory used by destroyed
strings and reuse it, so this shouldn't actually make a million memory
allocations.

Anyway, rather than arguing about whether hypothetical examples might be
better or not, it would be more productive to prototype how you think
this should work and then show some realistic benchmarks demonstrating
that the current approach is actually slower.

> > Another benefit is that it avoids the issue of the user's comparator not being
> > valid, which causes undefined behaviour which can cause a segmentation
> > fault (or indeed pretty much anything else).
> 
> > Even ruling out deliberate abuse and outright stupidity, excess precision can
> > cause this when the code looks benign (as in the bug fixed in 1.0.9).
> 
> The user might just a well cause a segfault in her function that returns
> a std::string.

Not if it's written in Python...

But I disagree that this is a fair comparison to make.  Writing code
which directly segfaults is really rather different to writing code with
a subtle logic error, such as this:

    bool operator<(double a, double b) {
	return foo(a) < foo(b);
    }

> It would be nice to have typed values in Xapian -- not everything is
> best represented as a string. Storing numbers as strings, and then
> comparing them as strings just seems a bit suboptimal; the
> sortable_(un)serialise treats the symptom to some extent, but the real
> issue remains.

Xapian isn't intended to be a general purpose database system.  If you
want to mix Xapian and SQL-like data handling, then the PostingSource
class in SVN trunk seems an easier approach than trying to recreate an
SQL-like environment inside Xapian.

My feeling is that adding this would require a lot of work, and increase 
complexity for a disproportionately small actual benefit.  And I've
already got plenty of things in Xapian that I want to work on!

But if you really think this makes sense, feel free to sketch out a
design, convince us that it is the way to go, and get coding.

Cheers,
    Olly



More information about the Xapian-discuss mailing list