Searching in multiple databases with different steeming

Olly Betts olly at survex.com
Mon Aug 15 00:27:23 BST 2022


On Sat, Aug 13, 2022 at 05:13:56PM +0200, Emmanuel Engelhart wrote:
> Unfortunately, the search request can either be steemed in English or in
> French. Which means that if an unsteemed pattern might exist in both
> languages (French and English share a lot of words), the steemed version
> will be for one language only and therefore we will get results for one
> database.
>
> For the moment, this is not really clear how we should deal with this
> problem/limitation. Any idea? Would that be possible to merge properly two
> Msets (resulting of two search requests)?

If you're using a stemming strategy which indexes unstemmed terms as
well (which you probably are - it's the default, and if you don't then
exact phrase searches aren't possible) then one option is to disable
stemming when searching such combinations of databases.

You lose the benefits of stemming, but also avoid issues where e.g. the
English stemmer creates an undesirable false match against an unrelated
word stemmed by the French stemmer to the same combination of
characters.

Or you can include the stemmer language in the prefix added to stemmed
terms, and then parse the query with each stemmer and combine with OP_OR
- the query optimiser will see that none of the stemmed English terms
are present in the French database and cull the useless part of the
query early on, so effectively you end up just running one version of
the query on each database.

Or you can add a `Lfr` term to every document in the French database
and `Len` to every document in the English one, and search for:

    Query(OP_FILTER, query_parsed_with_en_stemmer, Query("Len")) |
    Query(OP_FILTER, query_parsed_with_fr_stemmer, Query("Lfr"))

Again the query optimiser should simplify that to just run each version
of the parsed query by itself on the appropriate database.

The development version contains code to merge MSet objects which is
used for combining results from remotes, but 1.4 uses a different
approach for that and doesn't have such code.  It's not a public API,
but perhaps could be.  The tricky part of using it properly is that the
weights need to be scaled to be compatible for it to give correct
results (internally that's done for you for remote searches).

Cheers,
    Olly



More information about the Xapian-discuss mailing list