GSoC 2016 - Introduction

Richhiey Thomas richhiey.thomas at gmail.com
Thu May 5 21:59:48 BST 2016


Hello,

Thanks James for the reply. That cleared a few things out. Apologies for
replying late because of exams going on.

I was going through the previous clustering API to understand how it worked
and it seems like the the approach for construction of the termlists which
are used for distance metrics use TF-IDF weighting with cosine similarity,
which is very similar to the approach I would need for this project. Just
in this case, euclidian distance would be the metric.

Would it be good to structure it in a way similar to the previous API with
a few changes?

For example, the Xapian::DocSimCosine::similarity( ) function in itself
calculates the tf idf vectors and calculates the similarity. Instead would
it be possible to have a custom weighting scheme sub classing
Xapian::Weight? This can help in providing the user an option about which
weighting scheme to use to create document vectors in K-means.

More ways of creating document sources should be allowed, for example from
a vector of docid's that the user has.

I have also been looking at the existing test API and I'll create a new PR
for a simple test in the next 1-2 days, maybe for checking whether the
value of k is valid or checking the euclidian distance calculations for
document vectors.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160506/5efc4814/attachment.html>


More information about the Xapian-devel mailing list