[Xapian-tickets] [Xapian] #756: Implement dice coefficient weight metric
Xapian
nobody at xapian.org
Wed Apr 4 23:50:55 BST 2018
#756: Implement dice coefficient weight metric
--------------------+---------------------------
Reporter: gp1308 | Owner: gp1308
Type: task | Status: new
Priority: normal | Milestone:
Component: Other | Version:
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
--------------------+---------------------------
Description changed by gp1308:
Old description:
> In this task, I plan to implement dice coefficient(also known as dice
> similarity coefficient) metric. This metric is generally used to compare
> the similarity between two sets.
>
> Please report if you find any mistake in explanation or have any concern
> related to the implementation.
>
> ----
> Formula:
>
> [[Image(dcs.gif)]]
>
> Q -> Query containing a set of terms
> C -> indexed candidate document
>
> Statistics needed to compute the dcs(dice coefficient similarity for
> short) for each document in database:
> 1. cardinality of the candidate document set.
> 2. cardinality of the query set.
> 3. the intersection of query set and candidate document set.
>
> ----
>
> Example:
>
> d1 = [one two three four five two four]
>
> d2 = [one three four six eight three]
>
> d3 = [nine ten three seven two]
>
> Q1 = [one three]
>
> Using dice coefficient formula, similarity metric for each document w.r.t
> given query Q1 is as below:
>
> d1_score = 0.5, d2_score = 0.5714, d3_score = 0.2857
>
> ----
> Implementation plan:
>
> In Xapian, Score computation is done over each term in the given query
> independently rather than over entire query set for a given document and
> score is accumulated over all the terms in a query.
>
> The cardinality of query set and of a candidate document is constant over
> all the terms in a query. Hence denominator need to be computed only once
> for a selected document. Cardinality of document is nothing but number of
> unique terms(this stat is supported by Xapian::Weight class)
>
> Intersection of query set and document set is number of terms from a
> query set appear in a given document. For each term in a query set, if
> term appears in a document, it contributes unit score(like boolean
> score).
New description:
In this task, I plan to implement dice coefficient(also known as dice
similarity coefficient) metric. This metric is generally used to compare
the similarity between two sets.
Please report if you find any mistake in explanation or have any concern
related to the implementation.
----
Formula:
[[Image(dcs.gif)]]
Q -> Query containing a set of terms
C -> indexed candidate document
Statistics needed to compute the dcs(dice coefficient similarity for
short) for each document in database:
1. cardinality of the candidate document set.
2. cardinality of the query set.
3. the intersection of query set and candidate document set.
----
Example:
d1 = [one two three four five two four]
d2 = [one three four six eight three]
d3 = [nine ten three seven two]
Q1 = [one three]
Using dice coefficient formula, similarity metric for each document w.r.t
given query Q1 is as below:
d1_score = 0.5714, d2_score = 0.5714, d3_score = 0.2857
----
Implementation plan:
In Xapian, Score computation is done over each term in the given query
independently rather than over entire query set for a given document and
score is accumulated over all the terms in a query.
The cardinality of query set and of a candidate document is constant over
all the terms in a query. Hence denominator need to be computed only once
for a selected document. Cardinality of document is nothing but number of
unique terms(this stat is supported by Xapian::Weight class)
Intersection of query set and document set is number of terms from a query
set appear in a given document. For each term in a query set, if term
appears in a document, it contributes unit score(like boolean score).
--
--
Ticket URL: <https://trac.xapian.org/ticket/756#comment:3>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list