[Xapian-devel] Contribute in the clustering project.

Wed Mar 26 17:32:31 GMT 2014

Hi everyone,

I was not able to submit my proposal due to some last minute error. But,
even then I would love to contribute in the Clustering project.

I had proposed a Scalable and efficent heirarchical clustering method which
was parallelizable on multicore CPUs and clustered upto 150,000 documents
in 2 seconds on a dual core CPU.

Below are the details of the algorithm.... Please let me know if any of
this will help...

The present implementation of the clustering algorithm in the Xapian
library which is a Single Linkage hierarchical clustering lacks in
addressing the problem of larger datasets with limited memory resources.
Hence as the dataset size increases it doesn't scale up to the memory
requirements and the running time. The BIRCH(Balanced Iterative Reducing
and Clustering using Hierarchies) algorithm[1] is a scalable and efficient
clustering algorithm that uses an in-memory data-structure called as
CF-tree(Cluster Feature-tree) which is inspired from the B+ tree and Red
Black tree. This algorithm is easily parallelizable as per mentioned in [2]
and gives even more speed-ups with multi-core CPUs. For 150,000 documents
the speed with a dual-core CPU is just 1.9 seconds as mentioned in [2].

*BIRCH Algorithm: *Based on a semantic similarity of each documents the
distance metric is calculated between each documents. The BIRCH algorithm
involves of forming a CF-tree which is a tree consisting all the entries of
the sub-clusters. Its only the initial scan, that involves the i/o
operations, while forming the sub-clusters; the dense data-points are
clustered together and the sparse ones are not allowed. The later
operations are all in-memory operations.

This makes the BIRCH algorithm faster than any other hierarchical
clustering algorithm in terms of memory operations as well as execution
runtime*. *As the data goes on piling, sub-clusters increase. Complexity of
this algorithm is *O(n**2**). *

*CF-tree : *The CF-tree consists of 3 entries: CF=(N,LS,SS), where N is the
number of data-points, LS is the linear sum of the N data-points i.e. Xi
and SS is the square sum of the N data-points i.e. Xi2 .

Its a height balanced tree with a branching factor B for non-leaf nodes and
L for a leaf node, that depends on the depth of the tree. Also depending on
the page size(decided on the available memory), a threshold T is decided,
and not more than this size, data-points are allowed in the each
sub-cluster. To balance the tree and for the efficient scan, each non-leaf
node contains at the B entries of the form{ CFi ,childi } where i=1,2,...
B, childi is a pointer to its i-th child node and CFi is the CF entry
represented by this child(This can be compared to the B+ tree way of
inserting nodes and balancing them like in the Red-black trees). A leaf
node contains at most L entries and each entry is a CF. In addition, prev
and next pointers are also given to the leaf node to chain its siblings for
an efficient scan.

The CF-tree entry that define the average inter-cluster distance, average
intra-cluster distance and variance increase distance can be referred in
[1]

*Parallel BIRCH: *The parallel BIRCH algorithm is based on the single
Program multiple Data(SPMD) model using message passing. The leaf-nodes
formed in the CF tree can be distributed among the processors and the
processors then communicate with each other to share the inter-cluster
distances. This model is fruitful because there are much less leaf-nodes in
the tree that include many sub-clusters, hence the information shared is
very less and doesn't hamper the performance of the algorithm. The
algorithm terminates when the change in clusters between consecutive
iterations becomes less than some specified threshold.

The main steps of the PBIRCH algorithm are:

1. Data distribution : Initially data is divided equally among all
processors. When new data items arrive they are distributed cyclically to
all the processors.

2. Computing the CF trees concurrently: Initial in memory CF tree is
generated by each processor using local data. If a tree runs out of memory
at any point of

time then it is rebuilt by increasing the threshold .

3. Apply clustering on the leaf nodes: Apply a parallel clustering
algorithm on the CFs stored at the leaf nodes.
Parallel languages like OpenMP or OpenCL can be used to program the CPU
hardware....!

*References:*

[1] Birch: A new Data Clustering algorithm and its Application.

[2] PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data.

I know that this is an entire algorithm to be implemented, but maybe some
parts of it might be useful in the project you people do.

Cheers,

Shreedhar.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140326/3b075f95/attachment.html>