[Xapian-devel] Document clustering module?
☼ 林永忠 ☼ (Yung-chung Lin)
henearkrxern at gmail.com
Mon Sep 17 11:38:25 BST 2007
Then I think the interface can become like this:
// Cluster documents by document value 1
matches.group_by_value(1);
// Iterate through clusters and mset items in each cluster.
for (ClusterIterator citer = matches.clusters_begin();
citer != matches.clusters_end(); ++citer) {
// get_cluster_id() returns the internal cluster index
cout << citer->get_cluster_id() << endl;
// Cluster's ID (or index) is just an unsigned integer.
// Cluster's ID (or index) and mset item's index can be simply stored in
// std::vector<std::vector> or std::vector<std::map>
for (MSetIterator miter = citer->mset_begin();
miter != citer->mset_end(); ++miter) {
// Using miter->get_cluster_id() here returns the same.
cout << "Doc " << *miter << " is in cluster "
<< citer->get_cluster_id() << endl;
}
}
I believe cluster name can be added in the core easily if there is a
need. The access method can be like this:
citer->set_cluster_name("some_mysterious_cluster")
citer->get_cluster_name();
Best,
Yung-chung Lin
On 9/17/07, Richard Boulton <richard at lemurconsulting.com> wrote:
> Olly Betts wrote:
> >> Enquire enq;
> >> ClusterSingleLinkage cluster_algorithm;
> >> enq.set_clustering_method(cluster_algorithm);
> >> MSet matches = enq.get_mset(1, 10);
> >> cout << matches.get_cluster_count() << endl;
> >> for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
> >> cout << "Document " << *miter << " is in cluster "
> >> << miter->get_cluster_id() << endl;
> >> }
> >>
> >> And let API users do what they want to do with the clusters.
> >
> > Yes, that seems a very nice approach. It also more naturally allows the
> > possibility of using document similarity to eliminate near-duplicates -
> > to do that efficiently you want to do it as matches are generated so
> > that you can stop when you have enough in the MSet.
> >
> > It wouldn't allow generating of different clusters of the same results
> > (without rerunning the search) but that doesn't seem like it's likely to
> > be an annoying limitation.
>
> We've also had the idea of extending the collapse mechanism to group by
> a value (instead of just returning the top document in a collapse group,
> as it currently does). This kind of interface would allow that to be
> represented, too.
>
> There would need to be some way to get a list of the cluster ids
> allocated for a given mset, and probably also a way to get further
> information on a cluster - some clustering algorithms allow a name to be
> assigned to a cluster, so we should be able to provide that, (and if we
> were performing a "group by value" operation instead of a cluster, the
> value for each group should be available).
>
> --
> Richard
>
More information about the Xapian-devel
mailing list