help improving relevance of snippets displayed by Omega

Mon Sep 21 23:22:03 BST 2020

On Sat, Sep 19, 2020 at 10:56:30PM -0400, Michael Decerbo wrote:
> But I'm still doubtful that expanding the sample size could be the right
> way to obtain excerpts from the document that are relevant to the query.
> Suppose that the sample size were even as big as 10% of the average
> document size, queries contained only a single term, and a typical query
> term appeared on average only once per document.

FWIW, that's too low - for highlighting purposes only documents the term
appears in are interesting, so the relevant average to consider here is
only over documents with 1 or more occurrences.  You'll never find the
term in a document it doesn't occur in, no matter how much of it you
store.

> In that case, it seems to
> me that nine out of ten samples would not contain the single query term, so
> that nine times out of ten the snippet generated from the sample would not
> contain the query term. Is my thinking accurate about this, or am I again
> missing something?

I'm suggesting you set the sample size so that ALL of the text is stored
for MOST documents (there are usually outliers, so having a limit is a
good idea so that if someone adds a terabyte file of random ASCII to
your system that doesn't result in pointless bloat).

> In general, I'm wondering how best to use Xapian so that, at query time, my
> application can display an excerpt that is relevant to the query, not a
> sample chosen at indexing time without regard to the query that may or may
> not contain the query term(s). For example, TheyWorkForYou.com is listed on
> xapian.org as a site using Xapian, and when I enter a single-term query on
> that site the document excerpts provided as part of the search results
> invariably include highlighted words, possibly stemmed, responsive to the
> query. That's the effect I would like to achieve.

You need the document text in order to select a dynamic sample, so
either you need to store that text in Xapian, or obtain it at search
time from some external source (which needs to be reasonably efficient
as you need to do this for each document in a page of results - if it
takes 0.1 seconds per document to get the text, you're adding a second
to the time to render a page of 10 results).

With Omega, storing in Xapian is well supported (via setting a large
sample size) so that's what I'm suggesting for the situation you
described.  If you have the text easily accessible somewhere else you
can make use of that, but as I already said you'll need to write some
code - either your own front end (which is what TheyWorkForYou has) or
modifying Omega.

Cheers,
    Olly