Newbie: Minimum Indexing Requirements and Strictness

Fri Jan 7 05:11:12 GMT 2022

On Sun, Dec 26, 2021 at 11:47:55PM -0500, Dustin Oprea wrote:
> When I experiment with fully relying on prefixed values, I find that I
> can't omit the unprefixed/unqualified *index_text()* calls or else my
> queries all return empty. What is going on here? Do I always have to index
> some base content even though I might never query it? Is there more
> documentation about this?

If I follow, you have a free-text field (say "title") and want to index
the text from it with a prefix, but you also want an unprefixed search
to match it.

If so, you can index the text without a prefix too, but I'd generally
suggest you instead map "no prefix" to the term prefix for title as
well as to no term prefix with:

    qp.add_prefix("", "");
    qp.add_prefix("", "S");

Then:

    $ python3 -c 'import xapian; qp = xapian.QueryParser(); qp.add_prefix("", ""); qp.add_prefix("", "S"); print(str(qp.parse_query("foo bar")))'
    Query(((foo at 1 OR Sfoo at 1) OR (bar at 2 OR Sbar at 2)))

And you can also support `title:foo` with:

    qp.add_prefix("title", "S");

There's a bit of a trade-off here - if you map to multiple term prefixes
as above the query is more complex, which will tend to be slower.  One
advantage is you can tell which field or fields a term matched in, and
another is it allows searching only the body text (you could also map
say `qp.add_prefix("body", "");` and then title and body would be
searched by default, but `body:foo` would provide a way to search only
the body.

If you index a field both with a prefix and unprefixed, then indexing
will be slower and you'll end up with more posting data for the
unprefixed terms so the database will be larger, but that data is at
least all in one place rather than having to OR two terms at search time
to get it.  Denser posting lists also tend to compress a bit better.

However, if you had two fields you never intended to search separately
then you probably should just index them to a single common prefix (or
both unprefixed).

> Is there a way of increasing strictness so stupid issues like invalid
> prefixes referenced from the query will cause the search to fail or at
> least return empty?

I'm not sure which sort of prefix you're asking about here.

If you mean the user-visible prefix, e.g. the user searching for
`invalid:foo` then that's not an error because it'd break cases like
searching for a URL, a DOS-style path with a drive letter, a Perl module
name, etc.  This also matches how most search engines seem to handle
this - e.g. `site:xapian.org query` limits the search to just xapian.org
on most general purpose web searches, but `invalid:xapian.org query` is
not an error, and seems to be handled similarly to the same query with
the `:` replaced by a different non-word character.

If you mean the term prefix (e.g. add_prefix("title", "XYZZY") where
no terms in the database actually have an XYZZY prefix, then there isn't
really a good way to detect that - the prefix might be valid and just
no documents containing that field have been indexed yet.  The prefixes
used at index time aren't recorded by the database explicitly, only
implicitly on any terms that get them.

If you're trying to debug your indexing vs query parsing (or just
trying to get to grips with how it works), looking at what terms are
indexed for a document in the database (there's a xapian-delve tool
which can be handy for this) and the result of parsing queries (in
Python, str(query); in C++, query.get_description()) should help.

> If my query is just an operatorless list of terms, does the parser
> automatically apply the default operator between them?

Yes, that's what the default operator is for.

Cheers,
    Olly