NEAR non-leaf subqueries

Olly Betts olly at survex.com
Fri Jan 20 03:37:41 GMT 2017


On Thu, Jan 12, 2017 at 07:53:21PM +0100, Jean-Francois Dockes wrote:
> Olly Betts writes:
>  > On Wed, Jan 04, 2017 at 07:29:58AM +0100, Jean-Francois Dockes wrote:
>  > > I'd rather go back to 1.2 than used a patched 1.4 by the way.
>  > 
>  > Once we have a working patch, it should be mergable into 1.4.x (I can't
>  > see why any ABI changes would be needed) so using a patched 1.4
>  > shouldn't be an issue.
> 
> My phrase was unclear: explanation: I could use a patched 1.4 on Windows
> where libxapian is bundled with recoll, but I was thinking ahead to a
> situation where I'd have a 1.2/1.4 choice on Linux, where bundling a
> patched 1.4 would not be acceptable. In the latter case, I'd rather use 1.2
> because of the NEAR issue.

I follow, but my point was that once we have a patch we're confident
works we can merge it in for the next 1.4.x release - so you shouldn't
need to consider this as an option anyway.

> Recoll also supports multi-word synonyms which could potentially generate
> PHRASE subqueries inside NEAR queries, but this understandably already did
> not work with 1.2, so the multi-word expansions are only used when proximity
> is not involved (by the way, proximity of phrases does make sense in this
> case, if there is a wishlist somewhere, but it's admittedly not an issue
> that most users will be concerned with...).

Another case for https://trac.xapian.org/ticket/508 I think.

>  >  * Currently the OP_OR subqueries can only have two subqueries of their own.
>  >    Lifting this restriction needs a bit of work on the new
>  >    OrPositionList class 
>  >    - the old patch used a series of pairwise OrPositionList objects, but the
>  >    new patch needs a single one instead - the class needs reworking to handle
>  >    that. 
>  > 
>  > So I think the second limitation needs addressing, and of course any bugs
>  > resolving.
> 
> I am not sure that I completely understand this paragraph, but, anyway,
> although I have a bit of trouble reading my own code, I think that recoll
> will only add flat OP_OR queries as subqueries of the NEAR one. I tested
> the patch and it does seem to answer my selfish needs...

The code I pushed before wouldn't handle an OR of more than two things,
so you couldn't do a 3+-way stem expansion:

    (text OR texts) NEAR (search OR searches OR searched OR searching)

But I've just pushed an update which will handle this.

>  > I can't promise anything re schedule, but hopefully we can sort this out
>  > fairly soon.  At least the solution for what's missing now is fairly clear -
>  > we probably want to put the sub-positionlists into a min heap.
> 
> See, you lost me with the last phrase, and that's why it's better that I
> don't get into Xapian-core internals :)

A heap is a datastructure which is good for merging ordered lists, and
a min-heap just means that the tip of the heap is the smallest entry
(a max-heap is probably more common).

But I tried a heap and having looked at how things work in practice I
concluded the heap really only benefits advancing to the next position,
whereas the common operation is skipping to at least position N.  In
practice the cost of maintaining the heap cancels out the savings, so
I've pushed a simpler approach to the branch:

https://github.com/ojwb/xapian/tree/orpositionlist

Can you give that some real-world testing?

Cheers,
    Olly



More information about the Xapian-discuss mailing list