Introduction and Doubts

nirmal singhania nirmal.singhania at st.niituniversity.in
Tue Mar 15 18:06:25 GMT 2016


sorry for late reply.i was not well

BM25 is a ranking measure based on probabilistic model.

but in CLUBS we are computing distance between document vectors in vector
space model.
so BM25 doesn't make any sense in vector space model.(correct me if i am
wrong
is there anything such as BM25 vectors and distance between them?)

BM25 is very useful for ranking documents retrieved from search queries.
But for representing documents in vector space model TD-IDF seems
appropriate.
Then from TF-IDF vectors ,we are calculating distance between documents
based on cosine similarity first , and optimizing SSQ error till we get our
final clusters.
After we get our clusters we apply euclidean similarity to further improve
our clustering(I have modified the original CLUBS to incorporate both
Cosine Similarity and Euclidean Similarity as they both have their pros and
cons)
Correct me if i am wrong
we will use clustering results we user clicks on similar results.

Also for putting clustering results in groups in,we will incorporate
keyword extraction techniques like KEA,TextRank,RAKE for getting keyword
from Search documents to get something like this (courtesy-carrot2)

http://project.carrot2.org/img/carrot2-demo-screenshot.gif

And as you know CLUBS is a Variant of Hierarchical Clustering and
Hierarchical Clustering is Known for its Accuracy in the way it approaches
the problem like a human would approach.
and speed is also better that best algorithms known for speed
(k-means,K-means++)

To Implement it Steps are
1)User types a search

2)Xapian outputs the results on based on default BM25 scheme

3)The Output of search is fed into clustering module

4)Documents are divided into clusters based on on CLUBS algorithm

5)Each document cluster is given a category based to keyword extraction)

6)Results given back to users ranked by default BM25 scheme(no change here)
with each result having link to similar results

7)Similar Results will be clustering results and they will be ranked
according to BM25 relevance to the search result whose similar results we
want. Also we did euclidean similarity measure after clustering which we
can use to rank similar results.

8)The results page is also having a section grouping returned documents
based on Keyword/category.

One thing we can do is to- show user the result of search and side by side
perform clustering.
This will make it better for people interested in search results and not in
clustering.



The Modules can include
1)Cluster(DocVector[] &V) --Returning Clusters(Hashmap with key as cluster
no and values as Documents which belong to that cluster)

2)DocVector(Mset &M)-Return array of Tf-idf vectors from search result
documents with each document having corresponding vector

3)EuclideanSim(DocVector &V1,DocVector &V2) Returns Similarity between two
document vectors

4)KeywordExtract(Cluster &C1) (Returing String keyword and assign it to
cluster)
Putting that keyword as title and all the documents in cluster returned by
clicking it

5)GetSimilar(Document &D) Returning Ranked Similar Documents based on
Clustering

Dependency will be on xapian MSet,BM25weight

The accuracy of clustering will be measured on various document clustering
datasets and will be improved upon on measures given previously.

One Suggestion i want is whether to use cosine similarity first or
euclidean similarity as there is not clear cut explanation which is better.
Based on your experience you might be the best person to guide through it.

please tell how much details i have to add to each part about
implementation and methodology
I have to improve a lot on it
Please Give your Suggestions.
Have a Nice Day

Regards,
Nirmal Singhania
B.tech III Yr

Regards,
Nirmal Singhania
B.tech III Yr

On Fri, Mar 11, 2016 at 5:48 PM, James Aylett <james-xapian at tartarus.org>
wrote:

> On Fri, Mar 11, 2016 at 01:21:14AM +0530, nirmal singhania wrote:
>
> > Tf-idf is most used used weighting scheme is easy to understand and has
> > been used in other frameworks like lucene and many other places.
> > okapi bm25(implemented in xapian) is theoretically better/improved
> measure
> > than tf-idf
>
> Okay, so doesn't that suggest using BM25 instead of tf-idf? Or even
> making it configurable, since Xapian already has an abstraction for
> weighting schemes, so the user can plug in whatever they want (with a
> sensible default)?
>
> > i am looking into various other weighting scheme which are there in
> > xapian or can be implemented like TF-ICF(term frequecy inverse
> > corpus frequency),TF-RF(term frequency-relevance frequency)
>
> If there's a useful weighting scheme to add for clustering that Xapian
> doesn't support, that could be a useful 'warmup' piece of work, before
> the main project starts, to help you get used to developing Xapian.
>
> > for evaluating the speed and accuracy of final clustering system we
> > can benchmark it against various other algos like k-means,HAC based
> > on the measures mentioned in previous
> > mail.(purity,F-measure,Entropy,F-Measure,Overall Similarity,Relative
> > Margin,Variance Ratio)
>
> Great. Sounds like you have lots of helpful detail for your proposal
> on this :-)
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160315/cb1c41bd/attachment-0001.html>


More information about the Xapian-devel mailing list