Boosted fields search in Python
James Aylett
james-xapian at tartarus.org
Thu Aug 9 21:35:38 BST 2018
On 9 Aug 2018, at 10:09, Katja Abramova <katja.abramova at dimension.it> wrote:
> I need to do a search for a
> multi-word query in which particular fields are boosted - preferably at
> query time. That is, given a query like "the cat is lying on the mat" (with
> an OR operator, ignoring word positions but with stemming and stop words
> removed), I'd like to search for that query in both, say Title and Body of
> the documents but with Title field boosted to 4 and Body to 2.
Hi, Katja!
There are a few different things going on here, so I'll try to go through them one at a time.
Field searching in Xapian is generally done using prefixes; the practical example in our "getting started" guide discusses this, and has sample code in python. I'd read from the beginning, including the core concepts. (https://getting-started-with-xapian.readthedocs.io/).
It also shows how to use the QueryParser to split and stem user-inputted queries into Xapian Query objects. You'll want to set the default_prefix when you call QueryParser::parse_query (this is covered in the concepts section of the getting started guide: https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/terms.html?highlight=default_prefix#fields-and-term-prefixes).
You'll end up with python that looks a little like this:
# Some code that sets up the queryparser (stemming, for instance).
# See the getting started guide for a complete example.
# ...
# S = Subject. Note that you can't use a keyword argument for default_prefix, so we have
# to provide the flags as well.
title_query = queryparser.parse_query(querystring, xapian.QueryParser.FLAG_DEFAULT, "S")
Then you need to use OP_SCALE_WEIGHT, as you've identified, to apply the different weightings to the queries parsed against the two fields.
weighted_title_query = xapian.Query(xapian.Query.OP_SCALE_WEIGHT, title_query, 4)
Finally you need to combine the two weighted queries. You can do this using OP_OR, which will rank higher a document where both the title and the body match. Alternatively, OP_MAX may work better (use whichever side ranks higher, which will probably be the higher-weighted one). Something like this:
final_query = xapian.Query(xapian.Query.OP_MAX, [weighted_title_query, weighted_body_query])
(Note that boosting title to 4 and body to 2 probably isn't better than just boosting title to 2 and leaving body at standard weighting. Of course if you have a more complex search structure going on then that may still make sense!)
Hope this helps!
J
--
James Aylett, occasional troublemaker & project governance
xapian.org
More information about the Xapian-discuss
mailing list