Introduction and Doubts

nirmal singhania nirmal.singhania at st.niituniversity.in
Wed Mar 9 04:57:38 GMT 2016


Hello All,I am Nirmal Singhania from NIIT University,India.
I am interested in Clustering of search results Topic.

I have been in field of practical machine learning and information
retrieval from quite some time.
I took various courses/MOOC on Information retrieval and Text Mining and
have been working on real life datasets(KDD99,AWID,Movielens).
Because the problems you face in real life ML/IR scenario is different is
different from what taught in theory.I am also working on R&D on "Hybrid
Techniques for Intrusion Detection using Data Mining and Clustering on
Newer Datasets".

Taking initial look at the docsim folder in xapian-core.
These are my insights
The clustering used is Single Link Agglomerative Hierarchical clustering.
Its Time Complexity is O(n^2) for n=number of documents.
At first Choosing K-means seems to be viable solution as K-Means has O(n)
Time Complexity.
But it has various Shortcomings
1) The learning algorithm requires apriori specification of the number of
cluster centers.
2)Different Initial Partitions can result in different final clusters
3)It does not work well with clusters of different size and Different
Density.
After That we Can Think of KMeans++
The *k*-means++ algorithm addresses the first of these obstacles by
specifying a procedure to initialize the cluster centers before proceeding
with the standard *k*-means optimization iterations
But it is a little bit slow due to cluster initialization.
Then we can think of bisecting k-means which is better than
k-means.but the bisecting
K-means algorithm is a divisive hierarchical clustering algorithm
It is little bit faster than original k-means but the results of clustering
are poorer than Hierarchical agglomerative clustering
based on various Metrics of Cluster quality such as
Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio.

based on my some time of Research,I have in mind a clustering algorithm
that can overcome Quality issues of K-means(and its variants) and Speed
Issues of Hierarchical Agglomerative Clustering.
Theoretically it can work O(n) and Can produce results better than HAC
based on various metrics.
I can't discuss it on mailing-list but you say we can discuss more about it
and its implementation in xapian in PM.

Thank you for your Time






Regards,
Nirmal Singhania
B.tech III Yr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160309/9d6e6136/attachment-0001.html>


More information about the Xapian-devel mailing list