GSoC 2016 - Introduction

Wed May 18 21:06:00 BST 2016

Hello,

I had been thinking about how to write tests that help us come up with the
public API that will be used for clustering and I'd just like to describe
two tests and the way I am thinking about the API. I'd like to know whether
I'm on the right path or how this can be improved.

1) Testcase to check euclidian similarity of document vectors

DEFINE_TESTCASE(euclidian, backend)
{
    Xapian::Database db(get_database("euclidian"));
    //Make this file contain two sentences which are identical and treated
as two diff docs
    //Get MSet containing two docs
    Document doc1 = mset[0].get_document();
    Document doc2 = mset[1].get_document();
    DocSim d;
    int sim = d.get_distance(doc1.termlist_begin(), doc1.termlist_end(),
doc2.termlist_begin(), doc2.termlist_end(), SIMILARITY_OPTION /* (in this
case, euclidian) */ );
    TEST( sim == 0)
}

The creation of TF-IDF vectors from the termlists of the documents will be
done inside the DocSim class. The get_distance() function calculates the
distance and we can support many similarity measures later on. The default
can be euclidian distance

2) Test case to check whether clusters are valid by checking whether any
cluster is empty

DEFINE_TESTCASE(custer1, backend)
{
    Xapian::Database db(get_database("cluster_api"));
    //Get Mset against a query, MSet -> matches
    Xapian::Cluster c;
    Xapian::ClusterSet cset = c.cluster(matches,k);
    if (cset != NULL)
    {
        for(Xapian::ClusterSetIterator i=cset.begin(); i!=cset.end(); i++)
        {
            Xapian::DocumentSet d = i.get_clusterdocs();
            TEST(d.size() != 0)
        }
    }
}

Xapian::Cluster class will contain the main clustering functionality which
will cluster the documents and store the results in a class
Xapian:ClusterSet, which is returned by Xapian::Cluster::cluster(). This
will also contain a vector of the cluster IDs and a map of document IDs and
its associated cluster ID.

Xapian::ClusterSet contains the cluster ID and vector of documents
belonging to that cluster. Xapian::ClusterSetIterator can be used to go
through the ClusterSet objects

The documents belonging to a certain cluster can be retrieved by a function
which returns documents to a DocumentSet. This can again be made iterable
but I don't know how productive making a DocumentSet would be.

This is a very rough way of how I think the API would be. I'd like to know
if there are places where I am going wrong so I can improve on them before
the coding period starts.

Also, I apologize for not being too responsive on the mailing list, but
I've been having exams going on. They'll be getting over on the 26th of
this month, after which I can concentrate on the project completely.

Thanks,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160519/8e4df826/attachment.html>