queries for a set of values

Olly Betts olly at survex.com
Sat Feb 22 04:30:50 GMT 2025


On Sat, Feb 22, 2025 at 12:19:38AM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > On Fri, Apr 26, 2024 at 10:37:37PM +0000, Eric Wong wrote:
> > > Say I have a bunch of values which I want to filter a query against.
> > > If I had boolean terms, it could just OP_OR against the whole set.
> > > IOW, this is what notmuch does with terms:
> > > 
> > > 	std::set<std::string> terms;
> > > 
> > > 	// notmuch populates terms via terms.insert(*i)...
> > > 
> > > 	Query(OP_OR, terms.begin(), terms.end());
> > 
> > The slicker way to do this (unless you need the std::set for other
> > reasons) would be:
> > 
> >     Xapian::Query filter = Xapian::Query::MatchAll;
> 
> (resurrecting topic from last year)
> 
> If I'm OR-ing, shouldn't that start as Xapian::Query::MatchNothing?

Oops, yes.

> >     while (more_terms()) {
> >         filter |= Xapian::Query(get_next_term());
> >     }
> > 
> > Assuming you're using Xapian >= 1.4.10 then |= on an OP_OR Query with
> > refcount 1 (as here) is specially optimised and just appends a new
> > subquery so you get a single OP_OR node and this is particularly
> > efficient (if the refcount is higher it'll build a tree, but still get
> > optimised the same way - it's just a bit less efficient because it needs
> > to allocate for each node in the tree).
> > 
> > One difference is that filter here will match everything if there are
> > no filter terms, so you can just always apply it:
> > 
> >     query = Xapian::Query(OP_FILTER, query, filter);

So this part is also wrong.

> > You can build it up the same way with:
> > 
> >     filter |= Query(OP_VALUE_RANGE, column, v, v);
> 
> OK, since I don't want to break support for Xapian <=1.4.9 users,
> this seems to work:

Using |= still works for any 1.4.x (back to dev release 1.3.2 actually),
but for <= 1.4.9 you don't get the optimisation that it appends in place
to a query with refcount 1.

So with >= 1.4.10 you get a single N-way OR node internally and with <=
1.4.9 you get a tree of depth N-1 which is a cascade of 2-way OR nodes.
Both get turned into the same thing by the query optimiser but the
latter has to allocate N-2 more internal OR nodes, and iterating over
the tree is a bit more work.

The string returned by Query::get_description() should actually reflect
this internal difference.

> for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++)  {
> 	Xapian::Document doc = i.get_document();
> 	std::string val = doc.get_value(column);
> 	*xqry = Xapian::Query(Xapian::Query::OP_OR, *xqry,
> 			Xapian::Query(
> 				Xapian::Query::OP_VALUE_RANGE,
> 				column, val, val));
> }

This will always create the depth N-1 tree of 2-way OR nodes, so it's a
bit less efficient for >= 1.4.10 and equivalent for older 1.4.x.

> > > It seems what I'm really looking for is an OP_VALUE_OR or OP_VALUE_IN;
> > > but only OP_VALUE_{GE,LE,RANGE} exists.
> > 
> > Just use OP_VALUE_RANGE with equal bounds.
> > 
> > Another approach is to use a custom PostingSource which can fetch the
> > value for that slot for each document being considered and check if it's
> > one of the values you want.
> 
> Noted.  I think I'll stick to (what appears to be) working code
> for now unless a custom PostingSource can be more efficient.

If you're checking for one of a set of values in the same slot (as you
seem to be) then the custom PostingSource only needs to fetch the value
once per document the filter gets checked for, whereas an OR of
OP_VALUE_RANGEs will end up fetching it for each OP_VALUE_RANGE checked
per document the filter gets checked for (the OR will stop checking when
it finds a match for a document at least).

I think it would be possible for the query optimiser to arrange for
all the OP_VALUE_RANGEs on the same slot to share a single ValueList,
but that isn't a currently implemented optimisation.

Cheers,
    Olly



More information about the Xapian-discuss mailing list