[Xapian-devel] Document clustering module?

☼ 林永忠 ☼ (Yung-chung Lin) henearkrxern at gmail.com
Sun Sep 16 13:26:05 BST 2007


Hi,

The attached file is my current public clustering interface. I think
it would be easier to have discussions with a header file present.
My clustering module is intended to cluster documents in MSets and it
can enhance query expansion, and clustering is totally done in memory.
I am not sure if clustering on documents in database is necessary,
since it really involves a huge amount of computation. In-memory
clustering on retrieved documents is an easier and I think it is also
useful.

DSet, in the header file, stands for one cluster of documents and
MultiDSet stands for clusters of documents.

I am using a standalone similarity function
'calculate_doc_similarity()' which is overridable. Then I don't use
the xapian's weighting schemes to calculate weights. (Partly because I
have not read through xapian source code yet.) The similarity measure
is based on vector space model, and API users can simply create their
own document similarity function  on their own. I am not sure if this
is an optimal design. Maybe putting the similarity function into a
class would be even better. It needs discussion.

Now, I am using MultiDSet to store documents. I am thinking if it
would better if it returns multiple MSets, MultiMset, but the design
will be different and more complicated.

I have read the coding styles in HACKING, so I believe my coding style
would be OK. The issues would be on scalability and maintainability.

Comments are welcome.

Best,
Yung-chung Lin

On 9/16/07, Olly Betts <olly at survex.com> wrote:
> On Sun, Sep 16, 2007 at 07:27:34PM +0800, Yung-chung Lin wrote:
> > I am implementing some document clustering algorithms in the xapian
> > core. I would like to know if this kind of module will be considered
> > to be incorporated into the core release.
>
> Yes - I think it fits with xapian-core's role, so the issues are things
> like scalability, maintainability, API consistency, etc.  The "HACKING"
> document in xapian-core has some tips for contributers.
>
> > Or is there already some document clustering module that is just not
> > open-sourced yet?
>
> Not that I'm aware of.
>
> Cheers,
>     Olly
>
-------------- next part --------------
/** \file cluster.h
 * \brief API for clustering retrieved documents
 */
/* Copyright 2007 Yung-chung Lin
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as
 * published by the Free Software Foundation; either version 2 of the
 * License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
 * 02110-1301 USA
 */

#ifndef XAPIAN_INCLUDED_CLUSTER_H
#define XAPIAN_INCLUDED_CLUSTER_H

#include <string>
#include <vector>
#include <xapian/base.h>
#include <xapian/deprecated.h>
#include <xapian/enquire.h>
#include <xapian/types.h>
#include <xapian/database.h>
#include <xapian/document.h>
#include <xapian/visibility.h>

namespace Xapian {
/// Document set
typedef std::vector<Document> DSet;

/// Multiple document sets
typedef std::vector<DSet> MultiDSet;

/// In-memory document clusterer.
class XAPIAN_VISIBILITY_DEFAULT Cluster {
  protected:
    class Internal;

    /// @internal Reference counted internals.
    Xapian::Internal::RefCntPtr<Internal> internal;

    /// @internal Constructor for internal use.
    explicit Cluster(Internal * i);
    
  public:
    /// Create an empty Xapian::Cluster.
    Cluster();
    
    /// Destroy a Xapian::Cluster.
    virtual ~Cluster();
    
    /// Copying is allowed (and is cheap).
    Cluster(const Cluster & other);

    /// Assignment is allowed (and is cheap).
    void operator=(const Cluster &other);
    
    /// Specify mset
    void set_mset(const MSet &mset);
    
    /// Specify the database being searched.
    void set_database(const Database & db);

    /// Specify the cutoff value among clusters
    void set_doc_similarity_cutoff(double cutoff);
    
    /// Document similarity measure
    virtual double calculate_doc_similarity(const Document &a,
                                            const Document &b);
    
    /// Cluster documents
    virtual void cluster() = 0;

    /// Get clustered data
    MultiDSet get_dsets();

    /** Returns a string representing the MSet.
     *  Introspection method.
     */
    virtual std::string get_description() const;
};

class XAPIAN_VISIBILITY_DEFAULT ClusterPartitional : public Cluster {
  public:
    virtual void cluster() = 0;
    virtual std::string get_description() const = 0;
};

class XAPIAN_VISIBILITY_DEFAULT ClusterOnePass : public ClusterPartitional {
  public:
    void cluster();
    std::string get_description() const;
};

class XAPIAN_VISIBILITY_DEFAULT ClusterHierarchical : public Cluster {
  protected:
    virtual double calculate_dset_similarity(Xapian::DSet &a,
                                             Xapian::DSet &b) = 0;
  public:
    virtual void cluster();
    virtual std::string get_description() const = 0;
};

class XAPIAN_VISIBILITY_DEFAULT ClusterSingleLinkage
    : public ClusterHierarchical {
  private:
    double calculate_dset_similarity(Xapian::DSet &a, Xapian::DSet &b);
  public:
    std::string get_description() const;
};

class XAPIAN_VISIBILITY_DEFAULT ClusterCompleteLinkage
    : public ClusterHierarchical {
  private:
    double calculate_dset_similarity(Xapian::DSet &a, Xapian::DSet &b);
  public:
    std::string get_description() const;
};

class XAPIAN_VISIBILITY_DEFAULT ClusterAverageLinkage
    : public ClusterHierarchical {
  private:
    double calculate_dset_similarity(Xapian::DSet &a, Xapian::DSet &b);
  public:
    std::string get_description() const;
};
    
}

#endif /* XAPIAN_INCLUDED_CLUSTER_H */


More information about the Xapian-devel mailing list