[Xapian-devel] Document clustering module?

Mon Sep 17 13:33:20 BST 2007

On Mon, Sep 17, 2007 at 06:38:25PM +0800, ??? ????????? ??? (Yung-chung Lin) wrote:
> Then I think the interface can become like this:
> 
>     // Cluster documents by document value 1
>     matches.group_by_value(1);

If you're talking about grouping collapsed documents, that should
probably happen during the match process, like collapse does.  Don't
worry too much about that idea - let's focus on the clustering part
for now, and just bear in mind how it might be reused for this (or
perhaps this problem is too different).

If you're not talking about that, there needs to be a clustering
algorithm specified for this to work.

>     // Iterate through clusters and mset items in each cluster.
>     for (ClusterIterator citer = matches.clusters_begin();
>            citer != matches.clusters_end(); ++citer) {
>         // get_cluster_id() returns the internal cluster index
>         cout << citer->get_cluster_id() << endl;
> 
>         // Cluster's ID (or index) is just an unsigned integer.
>         // Cluster's ID (or index) and mset item's index can be simply stored in
>         // std::vector<std::vector> or std::vector<std::map>
> 
>         for (MSetIterator miter = citer->mset_begin();
>                miter != citer->mset_end(); ++miter) {
>             // Using miter->get_cluster_id() here returns the same.
>             cout << "Doc " << *miter << " is in cluster "
>                     << citer->get_cluster_id() << endl;
>         }
>     }

I wouldn't get too fancy initially - we don't want to produce an
elaborate API which we think does everything conceivable, only to
discover a better approach or something it can't nicely do, and then
have to choose between keeping the sub-optimal API we have, or the pain
of deprecation and transition.  

Let's just go with tagging each MSet entry with a cluster id for now.
That seems a good starting point, and everything which has been
suggested so far can either be built on top of that, or provide that as
a side-effect.

And that should allow us to get clustering functionality into a release
sooner.

Cheers,
    Olly