xapian-letor: Prefix strategy discussion while indexing and preparing training file
james-xapian at tartarus.org
Tue Jun 28 10:46:04 BST 2016
On Mon, Jun 27, 2016 at 05:23:36PM +0530, Ayush Tomar wrote:
> Following the discussion with James on prefix strategy being used while
> indexing, at present, while preparing training file in xapian-letor
> (prepare_training_file() function in api/letor_internal.cc), the following
> hard-coded prefixes are added to every query from the query file:
> Xapian::QueryParser parser;
> parser.add_prefix("title", "S");
> parser.add_prefix("subject", "S");
> Hence, each query is parsed as follows: title:<query> ... <query>.
into the terms Sstemmed_word... which is the common Xapian approach
> A user might not have this specific metadata storage in the database or
> could have some other prefixes that were used while indexing.
This is equivalent to having to match prefix configuration between
indexing and searching in non-letor use. (Indeed, that configuration
happens again in bin/questletor.cc.)
> Anyway, the user's query file should take care of any prefixes in
> the query string by itself.
I disagree, because the query isn't the same as the terms in the
database, and that's something set at index time, and is (as I
understand it) independent of the letor training data (which is input
data, not Xapian terms).
I don't think we want people to have to convert training data (which
should be human-understandable) into files full of prefixed terms
(Zband ZSband &c).
> Hence, is it a good idea to give hard-coded support for these specific
> prefixes by default?
For the time being, I think it's fine. There are more important things
to worry about (such as which aspects of `Letor` belong on the
`Ranker`, representing the specialisation at work -- SVM, RankList or
whatever -- and which should be in an equivalent of `Enquire`,
representing the process of re-ranking an MSet).
James Aylett, occasional trouble-maker
More information about the Xapian-devel