[Xapian-tickets] [Xapian] #756: Implement dice coefficient weight metric

Xapian nobody at xapian.org
Wed Apr 4 22:02:58 BST 2018


#756: Implement dice coefficient weight metric
---------------------------+--------------------
        Reporter:  gp1308  |      Owner:  gp1308
            Type:  task    |     Status:  new
        Priority:  normal  |  Milestone:
       Component:  Other   |    Version:
        Severity:  normal  |   Keywords:
      Blocked By:          |   Blocking:
Operating System:  All     |
---------------------------+--------------------
 In this task, I plan to implement dice coefficient(also known as dice
 similarity coefficient) metric. This metric is generally used to compare
 the similarity between two sets.

 Please report if you find any mistake in explanation or have any concern
 related to the implementation.

 ----
 Formula:

 [[Image(dcs.gif)]]

 Q -> Query containing a set of terms
 C -> indexed candidate document

 Statistics needed to compute the dcs(dice coefficient similarity for
 short) for each document in database:
 1. cardinality of the candidate document set.
 2. cardinality of the query set.
 3. the intersection of query set and candidate document set.

 ----

 Example:

 d1 = [one two three four five two four]

 d2 = [one three four six eight three]

 d3 = [nine ten three seven two]

 Q1 = [one three]

 Using dice coefficient formula, similarity metric for each document w.r.t
 given query Q1 is as below:

 d1_score = 0.5, d2_score = 0.5714, d3_score = 0.2857

 ----
 Implementation plan:

 In Xapian, Score computation is done over each term in the given query
 independently rather than over entire query set for a given document and
 score is accumulated over all the terms in a query.

 The cardinality of query set and of a candidate document is constant over
 all the terms in a query. Hence denominator need to be computed only once
 for a selected document. Cardinality of document is nothing but number of
 unique terms(this stat is supported by Xapian::Weight class)

 Intersection of query set and document set is number of terms from a query
 set appear in a given document. For each term in a query set, if term
 appears in a document, it contributes unit score(like boolean score).

--
Ticket URL: <https://trac.xapian.org/ticket/756>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list