NEAR non-leaf subqueries
Jean-Francois Dockes
jf at dockes.org
Fri Jan 20 14:35:13 GMT 2017
Olly Betts writes:
> On Thu, Jan 12, 2017 at 07:53:21PM +0100, Jean-Francois Dockes wrote:
>
> > Recoll also supports multi-word synonyms which could potentially
> > generate PHRASE subqueries inside NEAR queries, but this
> > understandably already did not work with 1.2, so the multi-word
> > expansions are only used when proximity is not involved (by the way,
> > proximity of phrases does make sense in this case, if there is a
> > wishlist somewhere, but it's admittedly not an issue that most users
> > will be concerned with...).
>
> Another case for https://trac.xapian.org/ticket/508 I think.
The ticket only lists OP_OR as subqueries, but this case would be a bit
different, because one of the OR subqueries would actually be a phrase:
filesystem NEAR otherterm
may be transformed by synonym expansion into:
(filesystem OR (file PHRASE system)) NEAR otherterm
> > > * Currently the OP_OR subqueries can only have two subqueries of
> > > their own. Lifting this restriction needs a bit of work on the
> > > new OrPositionList class
> > > - the old patch used a series of pairwise OrPositionList
> > > objects, but the
> > > new patch needs a single one instead - the class needs reworking
> > > to handle that.
> > >
> > > So I think the second limitation needs addressing, and of course
> > > any bugs resolving.
> >
> > I am not sure that I completely understand this paragraph, but, anyway,
> > although I have a bit of trouble reading my own code, I think that recoll
> > will only add flat OP_OR queries as subqueries of the NEAR one. I tested
> > the patch and it does seem to answer my selfish needs...
>
> The code I pushed before wouldn't handle an OR of more than two things,
> so you couldn't do a 3+-way stem expansion:
>
> (text OR texts) NEAR (search OR searches OR searched OR searching)
>
> But I've just pushed an update which will handle this.
Ok, I hadn't even noticed the limitation. Dit it silently truncated the
OR list ? I did not have a formal test case for this, I just saw that the
error message was gone, and that the results appeared reasonable.
> > > I can't promise anything re schedule, but hopefully we can sort
> > > this out fairly soon. At least the solution for what's missing now
> > > is fairly clear - we probably want to put the sub-positionlists
> > > into a min heap.
> >
> > See, you lost me with the last phrase, and that's why it's better that I
> > don't get into Xapian-core internals :)
>
> A heap is a datastructure which is good for merging ordered lists, and
> a min-heap just means that the tip of the heap is the smallest entry
> (a max-heap is probably more common).
Thanks for the explanation.
> But I tried a heap and having looked at how things work in practice I
> concluded the heap really only benefits advancing to the next position,
> whereas the common operation is skipping to at least position N. In
> practice the cost of maintaining the heap cancels out the savings, so
> I've pushed a simpler approach to the branch:
>
> https://github.com/ojwb/xapian/tree/orpositionlist
>
> Can you give that some real-world testing?
It's not real real-world, but I built a contrived set of files on pairs of
stemmed words (all 2-word combinations of floor/floors/floored/flooring),
and the last commit of the branch works fine, finding all docs for
recollish "floor floor"p, which yields a Xapian request of:
((floors OR flooring OR floored OR floor) NEAR 12
(floors OR flooring OR floored OR floor))
But, actually, so does the previous version (commit 389dfb319a66), which
explains why I had not understood what the limitation was.
Both versions also work fine with "floor floor floor"p:
(floors OR flooring OR floored OR floor) NEAR 13
(floors OR flooring OR floored OR floor) NEAR 13
(floors OR flooring OR floored OR floor)
So: me happy but confused...
Cheers,
jf
More information about the Xapian-discuss
mailing list