GSoC 2016 - Introduction

Mon May 9 11:42:46 BST 2016

On Fri, May 06, 2016 at 02:29:48AM +0530, Richhiey Thomas wrote:

> I was going through the previous clustering API to understand how it worked
> and it seems like the the approach for construction of the termlists which
> are used for distance metrics use TF-IDF weighting with cosine similarity,
> which is very similar to the approach I would need for this project. Just
> in this case, euclidian distance would be the metric.
> 
> Would it be good to structure it in a way similar to the previous API with
> a few changes?

I suspect that the public API will want to be fairly similar to the
previous one, yes.

> For example, the Xapian::DocSimCosine::similarity( ) function in itself
> calculates the tf idf vectors and calculates the similarity. Instead would
> it be possible to have a custom weighting scheme sub classing
> Xapian::Weight? This can help in providing the user an option about which
> weighting scheme to use to create document vectors in K-means.

I doubt that will work. Xapian::Weight computes a single score for a
document against a given query. Similarity metrics in clustering
generally work by providing a distance between two vectors, each of
which represents a document. So the API you'll want is different to
::Weight.

It probably will be useful in future to allow for different metrics to
be used, though. That will probably involve separating creation of the
tf-idf vectors from calculating the similarity.

> More ways of creating document sources should be allowed, for example from
> a vector of docid's that the user has.

There does seem to be value (as a future extension of clustering) in
allow people to cluster based on just a set of documents.

> I have also been looking at the existing test API and I'll create a new PR
> for a simple test in the next 1-2 days, maybe for checking whether the
> value of k is valid or checking the euclidian distance calculations for
> document vectors.

Writing a tested euclidian distance calculation between two document
vectors sounds reasonably small, but it does require you to decide how
the document vectors are going to be represented. I don't think
that's particularly hard, but it means you should think of it in terms
of the public APIs that will be used to construct the set of doc vecs
out of an MSet, and how they'll be passed into the clustering system
(and how you'll then get the clusters out again).

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org