[Xapian-tickets] [Xapian] #756: Implement dice coefficient weight metric

Xapian nobody at xapian.org
Wed Apr 4 23:50:55 BST 2018


#756: Implement dice coefficient weight metric
--------------------+---------------------------
 Reporter:  gp1308  |             Owner:  gp1308
     Type:  task    |            Status:  new
 Priority:  normal  |         Milestone:
Component:  Other   |           Version:
 Severity:  normal  |        Resolution:
 Keywords:          |        Blocked By:
 Blocking:          |  Operating System:  All
--------------------+---------------------------
Description changed by gp1308:

Old description:

> In this task, I plan to implement dice coefficient(also known as dice
> similarity coefficient) metric. This metric is generally used to compare
> the similarity between two sets.
>
> Please report if you find any mistake in explanation or have any concern
> related to the implementation.
>
> ----
> Formula:
>
> [[Image(dcs.gif)]]
>
> Q -> Query containing a set of terms
> C -> indexed candidate document
>
> Statistics needed to compute the dcs(dice coefficient similarity for
> short) for each document in database:
> 1. cardinality of the candidate document set.
> 2. cardinality of the query set.
> 3. the intersection of query set and candidate document set.
>
> ----
>
> Example:
>
> d1 = [one two three four five two four]
>
> d2 = [one three four six eight three]
>
> d3 = [nine ten three seven two]
>
> Q1 = [one three]
>
> Using dice coefficient formula, similarity metric for each document w.r.t
> given query Q1 is as below:
>
> d1_score = 0.5, d2_score = 0.5714, d3_score = 0.2857
>
> ----
> Implementation plan:
>
> In Xapian, Score computation is done over each term in the given query
> independently rather than over entire query set for a given document and
> score is accumulated over all the terms in a query.
>
> The cardinality of query set and of a candidate document is constant over
> all the terms in a query. Hence denominator need to be computed only once
> for a selected document. Cardinality of document is nothing but number of
> unique terms(this stat is supported by Xapian::Weight class)
>
> Intersection of query set and document set is number of terms from a
> query set appear in a given document. For each term in a query set, if
> term appears in a document, it contributes unit score(like boolean
> score).

New description:

 In this task, I plan to implement dice coefficient(also known as dice
 similarity coefficient) metric. This metric is generally used to compare
 the similarity between two sets.

 Please report if you find any mistake in explanation or have any concern
 related to the implementation.

 ----
 Formula:

 [[Image(dcs.gif)]]

 Q -> Query containing a set of terms
 C -> indexed candidate document

 Statistics needed to compute the dcs(dice coefficient similarity for
 short) for each document in database:
 1. cardinality of the candidate document set.
 2. cardinality of the query set.
 3. the intersection of query set and candidate document set.

 ----

 Example:

 d1 = [one two three four five two four]

 d2 = [one three four six eight three]

 d3 = [nine ten three seven two]

 Q1 = [one three]

 Using dice coefficient formula, similarity metric for each document w.r.t
 given query Q1 is as below:

 d1_score = 0.5714, d2_score = 0.5714, d3_score = 0.2857

 ----
 Implementation plan:

 In Xapian, Score computation is done over each term in the given query
 independently rather than over entire query set for a given document and
 score is accumulated over all the terms in a query.

 The cardinality of query set and of a candidate document is constant over
 all the terms in a query. Hence denominator need to be computed only once
 for a selected document. Cardinality of document is nothing but number of
 unique terms(this stat is supported by Xapian::Weight class)

 Intersection of query set and document set is number of terms from a query
 set appear in a given document. For each term in a query set, if term
 appears in a document, it contributes unit score(like boolean score).

--

--
Ticket URL: <https://trac.xapian.org/ticket/756#comment:3>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list