queries for a set of values

Eric Wong e at 80x24.org
Sat Feb 22 00:19:38 GMT 2025


Olly Betts <olly at survex.com> wrote:
> On Fri, Apr 26, 2024 at 10:37:37PM +0000, Eric Wong wrote:
> > Say I have a bunch of values which I want to filter a query against.
> > If I had boolean terms, it could just OP_OR against the whole set.
> > IOW, this is what notmuch does with terms:
> > 
> > 	std::set<std::string> terms;
> > 
> > 	// notmuch populates terms via terms.insert(*i)...
> > 
> > 	Query(OP_OR, terms.begin(), terms.end());
> 
> The slicker way to do this (unless you need the std::set for other
> reasons) would be:
> 
>     Xapian::Query filter = Xapian::Query::MatchAll;

(resurrecting topic from last year)

If I'm OR-ing, shouldn't that start as Xapian::Query::MatchNothing?

>     while (more_terms()) {
>         filter |= Xapian::Query(get_next_term());
>     }
> 
> Assuming you're using Xapian >= 1.4.10 then |= on an OP_OR Query with
> refcount 1 (as here) is specially optimised and just appends a new
> subquery so you get a single OP_OR node and this is particularly
> efficient (if the refcount is higher it'll build a tree, but still get
> optimised the same way - it's just a bit less efficient because it needs
> to allocate for each node in the tree).
> 
> One difference is that filter here will match everything if there are
> no filter terms, so you can just always apply it:
> 
>     query = Xapian::Query(OP_FILTER, query, filter);
> 
> The notmuch way will match nothing for that case so you need to
> conditionalise applying the filter (assuming you still want to match
> something when there are no filter terms).

Ah, ok, so Xapian::Query::MatchNothing makes more sense to me.

> > With a set of integers I have (after sortable_serialise), would the
> > best way be to OP_OR a bunch of OP_VALUE_RANGE queries together?
> > 
> > So, perhaps something like:
> > 
> > 	Query(OP_OR,
> > 		Query(OP_VALUE_RANGE, column, v[0], v[0]),
> > 		Query(OP_VALUE_RANGE, column, v[1], v[2]),
> 
> Did you mean 1 and 1 here?

Yes :x

> > 		Query(OP_VALUE_RANGE, column, v[3], v[3]),
> > 		...
> > 		Query(OP_VALUE_RANGE, column, v[LAST], v[LAST]))
> > 
> > // Or (totally not even compile-tested and I don't know C++)
> > // something like:
> > 
> > 	std::vector<Xapian::Query> subq;
> > 
> > 	for (size_t i = 0; i < nelem; i++) {
> > 		std::string v = sortable_serialise(int_vals[i]));
> > 
> > 		subq.insert(Query(OP_VALUE_RANGE, column, v, v));
> > 	}
> > 
> > 	Query(OP_OR, subq.begin(), subq.end());
> 
> You can build it up the same way with:
> 
>     filter |= Query(OP_VALUE_RANGE, column, v, v);

OK, since I don't want to break support for Xapian <=1.4.9 users,
this seems to work:

for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++)  {
	Xapian::Document doc = i.get_document();
	std::string val = doc.get_value(column);
	*xqry = Xapian::Query(Xapian::Query::OP_OR, *xqry,
			Xapian::Query(
				Xapian::Query::OP_VALUE_RANGE,
				column, val, val));
}

> > It seems what I'm really looking for is an OP_VALUE_OR or OP_VALUE_IN;
> > but only OP_VALUE_{GE,LE,RANGE} exists.
> 
> Just use OP_VALUE_RANGE with equal bounds.
> 
> Another approach is to use a custom PostingSource which can fetch the
> value for that slot for each document being considered and check if it's
> one of the values you want.

Noted.  I think I'll stick to (what appears to be) working code
for now unless a custom PostingSource can be more efficient.



More information about the Xapian-discuss mailing list