[Xapian-devel] Document clustering module?
Richard Boulton
richard at lemurconsulting.com
Mon Sep 17 10:32:08 BST 2007
Olly Betts wrote:
>> Enquire enq;
>> ClusterSingleLinkage cluster_algorithm;
>> enq.set_clustering_method(cluster_algorithm);
>> MSet matches = enq.get_mset(1, 10);
>> cout << matches.get_cluster_count() << endl;
>> for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
>> cout << "Document " << *miter << " is in cluster "
>> << miter->get_cluster_id() << endl;
>> }
>>
>> And let API users do what they want to do with the clusters.
>
> Yes, that seems a very nice approach. It also more naturally allows the
> possibility of using document similarity to eliminate near-duplicates -
> to do that efficiently you want to do it as matches are generated so
> that you can stop when you have enough in the MSet.
>
> It wouldn't allow generating of different clusters of the same results
> (without rerunning the search) but that doesn't seem like it's likely to
> be an annoying limitation.
We've also had the idea of extending the collapse mechanism to group by
a value (instead of just returning the top document in a collapse group,
as it currently does). This kind of interface would allow that to be
represented, too.
There would need to be some way to get a list of the cluster ids
allocated for a given mset, and probably also a way to get further
information on a cluster - some clustering algorithms allow a name to be
assigned to a cluster, so we should be able to provide that, (and if we
were performing a "group by value" operation instead of a cluster, the
value for each group should be available).
--
Richard
More information about the Xapian-devel
mailing list