[Xapian-discuss] Lucene ranking formula

James Aylett james-xapian at tartarus.org
Wed Nov 10 11:29:24 GMT 2004


No mention of where it came from, but:

----------------------------------------------------------------------
For the record, Lucene's scoring algorithm is, roughly:

  score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)
 
where:
  score_d   : score for document d
  sum_t     : sum for all terms t
  tf_q      : the square root of the frequency of t in the query
  tf_d      : the square root of the frequency of t in d
  idf_t     : log(numDocs/docFreq_t+1) + 1.0
  numDocs   : number of documents in index
  docFreq_t : number of documents containing t
  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
  norm_d_t  : square root of number of tokens in d in the same field
              as t

(I hope that's right!)

[Doug later added...]

Make that:
  
  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
                  boost_t) * coord_q_d

where

  boost_t    : the user-specified boost for term t
  coord_q_d  : number of terms in both query and document / number of
               terms in query

The coordination factor gives an AND-like boost to documents that
contain, e.g., all three terms in a three word query over those that
contain just two of the words.
----------------------------------------------------------------------

<http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q31>

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list