[Xapian-discuss] getting involved

Thu Jan 23 11:23:18 GMT 2014

On Wed, Jan 22, 2014 at 03:30:52PM +0100, El?d Biszak wrote:
> I am interested in the following tasks:
> 
> 1. I want to make the support of non-term sub-queries for positional
> queries. For positional sub-queries I think it would make sense. ( e.g:
> "Thomas Jefferson" NEAR "King George" )

Not only positional sub-queries - e.g. these have natural
interpretations too:

    A NEAR (B OR C)  equivalent to  (A NEAR B) OR (A NEAR C)

    (X AND Y) NEAR Z  equivalent to  (X NEAR Z) AND (Y NEAR Z)

In 1.2.x, we expand simple cases like these by simply expanding them as
shown, but that no longer happens on trunk - the internals of Query
objects were reimplemented, and making this work again is the remaining
thing to do for that.  And at least for the OR case it seems better to
handle it with an OrPositionList class.

> 2. Possible improvement in positional queries. In the docs it sais that
> "Queries which use positional information can be significantly slower to
> process [...] This will be improved in the future". Is there any thoughts
> on how to improve them?

There have been some substantial improvements recently.  1.2.14 added
an optimisation to check weight before positional conditions, which
helps a lot.  There are more major changes on trunk - the decoding of
positional data is now done lazily, and the position table key order has
changed to improve locality of access at search time.

There's likely still scope for further improvement though.

There's a patch in ticket #394 which is promising:

http://trac.xapian.org/ticket/394

This originally made a huge difference to the worst cases, but the
mechanism is hooked in rather crudely, which made us uncomfortable with
merging it.  Since then, the weight optimisation which was added in
1.2.14 has reduced the impact this patch would have - the timings in the
most recent comment show a 10% improvement, but that was before the most
recent changes on trunk.  Also, the current default pond size was an
arbitrary choice and nobody has tried tuning it and seeing what
difference that makes.

> 3. Adding the support of efficient ranking based on positional information.
> 
> What do you guys think, are these possible improvements? I have the time
> and the motivation.

The trick for (3) is to have a model which bounds the contribution which
the positional information can make to the weight.  With that, you can
feed that bound into Xapian's weighting model, and that will help to
eliminate many documents without having to actually look at their
positional data at all.

E.g. say you're looking for 10 results for the query:

  hello world

If you know that the weight bonus for the two words appearing together
is <= 6 (for example), and you have already found 10 results, and the
lowest scoring of these has a total weight of 50, then any document
which matches hello AND world but scores < 44 can't score enough to make
it into the final top 10.

Cheers,
    Olly