queries for a set of values
Eric Wong
e at 80x24.org
Sat Feb 22 00:19:38 GMT 2025
Olly Betts <olly at survex.com> wrote:
> On Fri, Apr 26, 2024 at 10:37:37PM +0000, Eric Wong wrote:
> > Say I have a bunch of values which I want to filter a query against.
> > If I had boolean terms, it could just OP_OR against the whole set.
> > IOW, this is what notmuch does with terms:
> >
> > std::set<std::string> terms;
> >
> > // notmuch populates terms via terms.insert(*i)...
> >
> > Query(OP_OR, terms.begin(), terms.end());
>
> The slicker way to do this (unless you need the std::set for other
> reasons) would be:
>
> Xapian::Query filter = Xapian::Query::MatchAll;
(resurrecting topic from last year)
If I'm OR-ing, shouldn't that start as Xapian::Query::MatchNothing?
> while (more_terms()) {
> filter |= Xapian::Query(get_next_term());
> }
>
> Assuming you're using Xapian >= 1.4.10 then |= on an OP_OR Query with
> refcount 1 (as here) is specially optimised and just appends a new
> subquery so you get a single OP_OR node and this is particularly
> efficient (if the refcount is higher it'll build a tree, but still get
> optimised the same way - it's just a bit less efficient because it needs
> to allocate for each node in the tree).
>
> One difference is that filter here will match everything if there are
> no filter terms, so you can just always apply it:
>
> query = Xapian::Query(OP_FILTER, query, filter);
>
> The notmuch way will match nothing for that case so you need to
> conditionalise applying the filter (assuming you still want to match
> something when there are no filter terms).
Ah, ok, so Xapian::Query::MatchNothing makes more sense to me.
> > With a set of integers I have (after sortable_serialise), would the
> > best way be to OP_OR a bunch of OP_VALUE_RANGE queries together?
> >
> > So, perhaps something like:
> >
> > Query(OP_OR,
> > Query(OP_VALUE_RANGE, column, v[0], v[0]),
> > Query(OP_VALUE_RANGE, column, v[1], v[2]),
>
> Did you mean 1 and 1 here?
Yes :x
> > Query(OP_VALUE_RANGE, column, v[3], v[3]),
> > ...
> > Query(OP_VALUE_RANGE, column, v[LAST], v[LAST]))
> >
> > // Or (totally not even compile-tested and I don't know C++)
> > // something like:
> >
> > std::vector<Xapian::Query> subq;
> >
> > for (size_t i = 0; i < nelem; i++) {
> > std::string v = sortable_serialise(int_vals[i]));
> >
> > subq.insert(Query(OP_VALUE_RANGE, column, v, v));
> > }
> >
> > Query(OP_OR, subq.begin(), subq.end());
>
> You can build it up the same way with:
>
> filter |= Query(OP_VALUE_RANGE, column, v, v);
OK, since I don't want to break support for Xapian <=1.4.9 users,
this seems to work:
for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); i++) {
Xapian::Document doc = i.get_document();
std::string val = doc.get_value(column);
*xqry = Xapian::Query(Xapian::Query::OP_OR, *xqry,
Xapian::Query(
Xapian::Query::OP_VALUE_RANGE,
column, val, val));
}
> > It seems what I'm really looking for is an OP_VALUE_OR or OP_VALUE_IN;
> > but only OP_VALUE_{GE,LE,RANGE} exists.
>
> Just use OP_VALUE_RANGE with equal bounds.
>
> Another approach is to use a custom PostingSource which can fetch the
> value for that slot for each document being considered and check if it's
> one of the values you want.
Noted. I think I'll stick to (what appears to be) working code
for now unless a custom PostingSource can be more efficient.
More information about the Xapian-discuss
mailing list