<div dir="ltr">Hi everyone,<div> </div><div>I was not able to submit my proposal due to some last minute error. But, even then I would love to contribute in the Clustering project.</div><div><br></div><div>I had proposed a Scalable and efficent heirarchical clustering method which was parallelizable on multicore CPUs and clustered upto 150,000 documents in 2 seconds on a dual core CPU. </div>

<div><br></div><div>Below are the details of the algorithm.... Please let me know if any of this will help...</div><div><br></div><div><br></div><div>The present implementation of

the clustering algorithm in the Xapian library which is a Single Linkage

hierarchical clustering lacks in addressing the problem of larger datasets with

limited memory resources. Hence as the dataset size increases it doesn’t scale

up to the memory requirements and the running time. The BIRCH(Balanced

Iterative Reducing and Clustering using Hierarchies) algorithm[1] is a scalable

and efficient clustering algorithm that uses an in-memory data-structure called

as CF-tree(Cluster Feature-tree) which is inspired from the B+ tree and Red

Black tree. This algorithm is easily parallelizable as per mentioned in [2] and

gives even more speed-ups with multi-core CPUs. For 150,000 documents the speed

with a dual-core CPU is just 1.9 seconds as mentioned in [2].<br></div><div><p></p>

<p><b>BIRCH Algorithm: </b>Based on a semantic similarity of each documents the

distance metric is calculated between each documents. The BIRCH algorithm

involves of forming a CF-tree which is a tree consisting all the entries of the

sub-clusters. Its only the initial scan, that involves the i/o operations,

while forming the sub-clusters; the dense data-points are clustered together

and the sparse ones are not allowed. The later operations are all in-memory

operations. </p>

<p>This

makes the BIRCH algorithm faster than any other hierarchical clustering

algorithm in terms of memory operations as well as execution runtime<i>. </i>As

the data goes on piling, sub-clusters increase. Complexity of this algorithm is

<i>O(n</i><i>2</i><i>). </i></p>

<p><b>CF-tree : </b>The CF-tree consists of 3 entries: CF=(N,LS,SS), where

N is the number of data-points, LS is the linear sum of the N data-points i.e.

Xi and

SS is the square sum of the N data-points i.e. Xi2 . </p>

<p>Its a height balanced tree with

a branching factor B for non-leaf nodes and L for a leaf node, that depends on

the depth of the tree. Also depending on the page size(decided on the available

memory), a threshold T is decided, and not more than this size, data-points are

allowed in the each sub-cluster. To balance the tree and for the efficient

scan, each non-leaf node contains at the B entries of the form{ CFi ,childi } where

i=1,2,... B, childi is a pointer to its i-th child node and CFi is the CF entry

represented by this child(This can be compared to the B+ tree way of inserting

nodes and balancing them like in the Red-black trees). A leaf node contains at

most L entries and each entry is a CF. In addition, prev and next pointers are

also given to the leaf node to chain its siblings for an efficient scan. </p>

<p>The CF-tree entry that define

the average inter-cluster distance, average intra-cluster distance and variance

increase distance can be referred in [1] </p>

<p><b>Parallel BIRCH: </b>The parallel BIRCH algorithm is based on the single

Program multiple Data(SPMD) model using message passing. The leaf-nodes formed

in the CF tree can be distributed among the processors and the processors then

communicate with each other to share the inter-cluster distances. This model is

fruitful because there are much less leaf-nodes in the tree that include many

sub-clusters, hence the information shared is very less and doesn’t hamper the

performance of the algorithm. The algorithm terminates when the change in

clusters between consecutive iterations becomes less than some specified

threshold. </p>

<p>The main steps of the PBIRCH

algorithm are: </p>

<p>1. Data distribution :

Initially data is divided equally among all processors. When new data items

arrive they are distributed cyclically to all the processors. </p>

<p>2. Computing the CF trees

concurrently: Initial in memory CF tree is generated by each processor using

local data. If a tree runs out of memory at any point of </p>

<p>time

then it is rebuilt by increasing the threshold . </p>

<p>3. Apply clustering on the leaf

nodes: Apply a parallel clustering algorithm on the CFs stored at the leaf

nodes. </p></div><div>Parallel languages like OpenMP or OpenCL can be used to program the CPU hardware....!</div><div><br></div><div><p class="MsoNormal"><b><span style="line-height:115%;font-family:'Times New Roman',serif">References:</span></b></p>

<p class="MsoNormal"><span style="line-height:115%;font-family:'Times New Roman',serif">[1] Birch: A new Data Clustering algorithm and its

Application.</span></p>

<p class="MsoNormal"><span style="line-height:115%;font-family:'Times New Roman',serif">[2]</span><b><span style="line-height:115%;font-family:Times-Bold,serif">

</span></b><span style="line-height:115%;font-family:Times-Bold,serif">PBIRCH: A Scalable

Parallel Clustering algorithm for Incremental Data.<font size="4"></font></span></p><p class="MsoNormal"><span style="line-height:115%;font-family:Times-Bold,serif"><br></span></p><p class="MsoNormal"><span style="line-height:115%;font-family:Times-Bold,serif"><br>

</span></p><p class="MsoNormal"><span style="line-height:115%;font-family:Times-Bold,serif"><br></span></p><p class="MsoNormal"><span style="line-height:115%;font-family:Times-Bold,serif">I know that this is an entire algorithm to be implemented, but maybe some parts of it might be useful in the project you people do.</span></p>

<p class="MsoNormal"><span style="line-height:115%;font-family:Times-Bold,serif"><br></span></p><p class="MsoNormal"><span style="line-height:115%;font-family:Times-Bold,serif">Cheers,</span></p><p class="MsoNormal"><span style="line-height:115%;font-family:Times-Bold,serif">Shreedhar.</span></p>

</div></div>