NEAR non-leaf subqueries

Jean-Francois Dockes jf at dockes.org
Fri Jan 20 14:35:13 GMT 2017


Olly Betts writes:
 > On Thu, Jan 12, 2017 at 07:53:21PM +0100, Jean-Francois Dockes wrote:
 > 
 > > Recoll also supports multi-word synonyms which could potentially
 > > generate PHRASE subqueries inside NEAR queries, but this
 > > understandably already did not work with 1.2, so the multi-word
 > > expansions are only used when proximity is not involved (by the way,
 > > proximity of phrases does make sense in this case, if there is a
 > > wishlist somewhere, but it's admittedly not an issue that most users
 > > will be concerned with...).
 > 
 > Another case for https://trac.xapian.org/ticket/508 I think.

The ticket only lists OP_OR as subqueries, but this case would be a bit
different, because one of the OR subqueries would actually be a phrase:

filesystem NEAR otherterm

may be transformed by synonym expansion into:

(filesystem OR (file PHRASE system)) NEAR otherterm


 > >  >  * Currently the OP_OR subqueries can only have two subqueries of
 > >  >    their own.  Lifting this restriction needs a bit of work on the
 > >  >    new OrPositionList class
 > >  >    - the old patch used a series of pairwise OrPositionList
 > >  >      objects, but the
 > >  >    new patch needs a single one instead - the class needs reworking
 > >  >    to handle that.
 > >  > 
 > >  > So I think the second limitation needs addressing, and of course
 > >  > any bugs resolving.
 > > 
 > > I am not sure that I completely understand this paragraph, but, anyway,
 > > although I have a bit of trouble reading my own code, I think that recoll
 > > will only add flat OP_OR queries as subqueries of the NEAR one. I tested
 > > the patch and it does seem to answer my selfish needs...
 > 
 > The code I pushed before wouldn't handle an OR of more than two things,
 > so you couldn't do a 3+-way stem expansion:
 > 
 >     (text OR texts) NEAR (search OR searches OR searched OR searching)
 > 
 > But I've just pushed an update which will handle this.


Ok, I hadn't even noticed the limitation. Dit it silently truncated the
OR list ? I did not have a formal test case for this, I just saw that the
error message was gone, and that the results appeared reasonable.


 > >  > I can't promise anything re schedule, but hopefully we can sort
 > >  > this out fairly soon.  At least the solution for what's missing now
 > >  > is fairly clear - we probably want to put the sub-positionlists
 > >  > into a min heap.
 > > 
 > > See, you lost me with the last phrase, and that's why it's better that I
 > > don't get into Xapian-core internals :)
 > 
 > A heap is a datastructure which is good for merging ordered lists, and
 > a min-heap just means that the tip of the heap is the smallest entry
 > (a max-heap is probably more common).

Thanks for the explanation.

 > But I tried a heap and having looked at how things work in practice I
 > concluded the heap really only benefits advancing to the next position,
 > whereas the common operation is skipping to at least position N.  In
 > practice the cost of maintaining the heap cancels out the savings, so
 > I've pushed a simpler approach to the branch:
 > 
 > https://github.com/ojwb/xapian/tree/orpositionlist
 > 
 > Can you give that some real-world testing?

It's not real real-world, but I built a contrived set of files on pairs of
stemmed words (all 2-word combinations of floor/floors/floored/flooring),
and the last commit of the branch works fine, finding all docs for
recollish "floor floor"p, which yields a Xapian request of:

((floors OR flooring OR floored OR floor) NEAR 12
 (floors OR flooring OR floored OR floor))

But, actually, so does the previous version (commit 389dfb319a66), which
explains why I had not understood what the limitation was.

Both versions also work fine with "floor floor floor"p:

(floors OR flooring OR floored OR floor) NEAR 13
(floors OR flooring OR floored OR floor) NEAR 13
(floors OR flooring OR floored OR floor)

So: me happy but confused...

Cheers,

jf



More information about the Xapian-discuss mailing list