Weighting the author of a doc when that term can also appear as a frequent term in other docs
    Alex Aminoff 
    aminoff at nber.org
       
    Thu Sep 28 18:27:18 BST 2017
    
    
  
We have a corpus of academic papers. Sometimes it happens that there is 
an academic controversy and one paper is a response or rebuttal to 
another paper. The name of the author of the first paper may appear many 
times in the second paper. So in light of this, how should we set our 
weight on the author field?
Here is an example:
http://www.nber.org/papers/w11215
  in which the term "Hoxby" appears 315 times, referring to several 
previous papers by Hoxby
http://www.nber.org/papers/w11216
  in which the term "Rothstein" is used 47 times
So if a user searches for "Hoxby", I would prefer that the comment on 
Hoxby not utterly dominate search results for which Hoxby is the author. 
But I don't want to set the weight on the author field to like 300, that 
would cause a search for "Moore's Law" to be dominated by results 
written by authors named Moore.
One suggestion someone had was what if the 300th mention of Hoxby was 
not as important as the first. I tried to read
  https://xapian.org/docs/bm25.html
and I think I conclude that as long as f is small relative to L or K, 
the value of the expression will increase linearly with f. To make it 
less than linear, we might invoke
> BM25 originally introduced another constant, as a power to which f and 
> K are raised. However, Stephen remarks that powers other than 1 were 
> /'not helpful'/, and other tests confirm this, so Xapian's 
> implementation of BM25 ignores this.
>
If I could raise f to a power less than 1, that would do what I want. 
But I am not at all sure this is the right approach.
Perhaps in real use this will turn out to be a minor issue.
  - Alex
    
    
More information about the Xapian-discuss
mailing list