<div dir="ltr"><span style="font-size:12.8px">sorry for late reply.i was not well </span><div style="font-size:12.8px"><br><div>BM25 is a ranking measure based on probabilistic model.</div><div><br></div><div>but in CLUBS we are computing distance between document vectors in vector space model.</div></div><div style="font-size:12.8px">so BM25 doesn't make any sense in vector space model.(correct me if i am wrong</div><div style="font-size:12.8px">is there anything such as BM25 vectors and distance between them?)</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">BM25 is very useful for ranking documents retrieved from search queries.</div><div style="font-size:12.8px">But for representing documents in vector space model TD-IDF seems appropriate.</div><div style="font-size:12.8px">Then from TF-IDF vectors ,we are calculating distance between documents based on cosine similarity first , and optimizing SSQ error till we get our final clusters.</div><div style="font-size:12.8px">After we get our clusters we apply euclidean similarity to further improve our clustering(I have modified the original CLUBS to incorporate both Cosine Similarity and Euclidean Similarity as they both have their pros and cons) </div><div style="font-size:12.8px">Correct me if i am wrong</div><div style="font-size:12.8px">we will use clustering results we user clicks on similar results.<br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Also for putting clustering results in groups in,we will incorporate keyword extraction techniques like KEA,TextRank,RAKE for getting keyword from Search documents to get something like this (courtesy-carrot2)</div><div style="font-size:12.8px"><br></div><div><span style="font-size:12.8px"><a href="http://project.carrot2.org/img/carrot2-demo-screenshot.gif">http://project.carrot2.org/img/carrot2-demo-screenshot.gif</a></span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">And as you know CLUBS is a Variant of Hierarchical Clustering and Hierarchical Clustering is Known for its Accuracy in the way it approaches the problem like a human would approach.</div><div style="font-size:12.8px">and speed is also better that best algorithms known for speed (k-means,K-means++)</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">To Implement it Steps are</div><div style="font-size:12.8px">1)User types a search</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">2)Xapian outputs the results on based on default BM25 scheme</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">3)The Output of search is fed into clustering module </div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">4)Documents are divided into clusters based on on CLUBS algorithm</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">5)Each document cluster is given a category based to keyword extraction)</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">6)Results given back to users ranked by default BM25 scheme(no change here) with each result having link to similar results</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">7)Similar Results will be clustering results and they will be ranked according to BM25 relevance to the search result whose similar results we want. Also we did euclidean similarity measure after clustering which we can use to rank similar results.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">8)The results page is also having a section grouping returned documents based on Keyword/category.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">One thing we can do is to- show user the result of search and side by side perform clustering.</div><div style="font-size:12.8px">This will make it better for people interested in search results and not in clustering.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">The Modules can include</div><div style="font-size:12.8px">1)Cluster(DocVector[] &V) --Returning Clusters(Hashmap with key as cluster no and values as Documents which belong to that cluster)</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">2)DocVector(Mset &M)-Return array of Tf-idf vectors from search result documents with each document having corresponding vector</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">3)EuclideanSim(DocVector &V1,DocVector &V2) Returns Similarity between two document vectors</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">4)KeywordExtract(Cluster &C1) (Returing String keyword and assign it to cluster)</div><div style="font-size:12.8px">Putting that keyword as title and all the documents in cluster returned by clicking it</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">5)GetSimilar(Document &D) Returning Ranked Similar Documents based on Clustering</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Dependency will be on xapian MSet,BM25weight</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">The accuracy of clustering will be measured on various document clustering datasets and will be improved upon on measures given previously.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">One Suggestion i want is whether to use cosine similarity first or euclidean similarity as there is not clear cut explanation which is better.</div><div style="font-size:12.8px">Based on your experience you might be the best person to guide through it.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">please tell how much details i have to add to each part about implementation and methodology</div><div style="font-size:12.8px">I have to improve a lot on it <br></div><div style="font-size:12.8px">Please Give your Suggestions.</div><div style="font-size:12.8px">Have a Nice Day</div><div style="font-size:12.8px"><br></div><div><div class="gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr">Regards,<div>Nirmal Singhania</div><div>B.tech III Yr</div></div></div></div></div></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Regards,<div>Nirmal Singhania</div><div>B.tech III Yr</div></div></div></div></div></div></div></div>

<br><div class="gmail_quote">On Fri, Mar 11, 2016 at 5:48 PM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Fri, Mar 11, 2016 at 01:21:14AM +0530, nirmal singhania wrote:<br>

<br>

> Tf-idf is most used used weighting scheme is easy to understand and has<br>

> been used in other frameworks like lucene and many other places.<br>

> okapi bm25(implemented in xapian) is theoretically better/improved measure<br>

> than tf-idf<br>

<br>

</span>Okay, so doesn't that suggest using BM25 instead of tf-idf? Or even<br>

making it configurable, since Xapian already has an abstraction for<br>

weighting schemes, so the user can plug in whatever they want (with a<br>

sensible default)?<br>

<span class=""><br>

> i am looking into various other weighting scheme which are there in<br>

> xapian or can be implemented like TF-ICF(term frequecy inverse<br>

> corpus frequency),TF-RF(term frequency-relevance frequency)<br>

<br>

</span>If there's a useful weighting scheme to add for clustering that Xapian<br>

doesn't support, that could be a useful 'warmup' piece of work, before<br>

the main project starts, to help you get used to developing Xapian.<br>

<span class=""><br>

> for evaluating the speed and accuracy of final clustering system we<br>

> can benchmark it against various other algos like k-means,HAC based<br>

> on the measures mentioned in previous<br>

> mail.(purity,F-measure,Entropy,F-Measure,Overall Similarity,Relative<br>

> Margin,Variance Ratio)<br>

<br>

</span>Great. Sounds like you have lots of helpful detail for your proposal<br>

on this :-)<br>

<div class="HOEnZb"><div class="h5"><br>

J<br>

<br>

--<br>

  James Aylett, occasional trouble-maker<br>

  <a href="http://xapian.org" rel="noreferrer" target="_blank">xapian.org</a><br>

</div></div></blockquote></div><br></div>