[Xapian-devel] [GSOC 2014] Indexing INEX dataset

James Aylett james-xapian at tartarus.org
Sat Mar 22 20:14:17 GMT 2014


We could have a convention to note custom wdfs by prefix in the database metadata. In theory I believe it's possible to recover the wdf by comparing term wdf to the position list for that term in that document (pick any document providing it's indexed consistently), but that ignores non-position terms (although that shouldn't be a problem here), and times when you've killed all position data, and probably wouldn't be appropriate for search time anyway. 

J

> On 22 Mar 2014, at 18:27, Parth Gupta <pargup8 at gmail.com> wrote:
> 
> Yes James, is there any automatic way to know what weight was used for titles or more generally for terms with some prefix?
> 
> 
> 
>> On Sat, Mar 22, 2014 at 1:35 PM, James Aylett <james-xapian at tartarus.org> wrote:
>> On 22 Mar 2014, at 08:22, Parth Gupta <pargup8 at gmail.com> wrote:
>> 
>> > For unsupervised approaches like BM25 this approach works well but letor does not need special weighting for title in this form as it itself assigns weights to title features separately.
>> >
>> > But I see your concern it would be a problem when BM25 is used on the index with this setup. Hence its preferable to take a note of this uplift in title weight for xapian-letor and normalize it everywhere calculating the statistics.
>> 
>> This would need configuring, though, wouldn't it? Not everyone (and I'm thinking of people who don't index using omindex here) applies a wdf of 5 while indexing titles; they may apply a different non-1 number, or just leave it at 1 (and possibly apply weighting at search time).
>> 
>> J
>> 
>> --
>>  James Aylett, occasional trouble-maker
>>  xapian.org
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140322/f0141d3c/attachment.html>


More information about the Xapian-devel mailing list