clustering technique using lsi

MURTUZA BOHRA murtuzabohra88 at gmail.com
Wed Mar 23 11:33:42 GMT 2016


The second reference in my proposal is a research paper which is based on
Vactor Space Model for clustering the search result. My proposed algorithm
is based on it but I slightly modified it. In the paper they are using
vector space model to first find the popular phrases in the documents to
label the different cluster, then based on the cluster label they are using
LSI to find the relevant document for each cluster. Now in my algorithm I
am doing the same thing but instead of finding different popular phrases I
am using the search documents itself to cluster the search result and to
have better results.

On Wed, Mar 23, 2016 at 4:45 PM, MURTUZA BOHRA <murtuzabohra88 at gmail.com>
wrote:

> I think should explain the proposed algorithm in the proposal more
> clearly. I did not do that because I thought it would make the proposal
> lengthy. Is there a word limit for the proposal??
>
> On Wed, Mar 23, 2016 at 4:40 PM, MURTUZA BOHRA <murtuzabohra88 at gmail.com>
> wrote:
>
>> Hello sir,
>>
>> You have interpreted correctly that clustering will be done by generating
>> the ring around the Document(i.e. the basic idea of LSI). But it is not
>> like increasing the radius and the next shell will be another cluster,
>> Rather it would pick one document (based on relevance score) and form a
>> ring around it to cluster the document, then from the remaining
>> documents(not in the cluster but are there in the search result) again
>> another document will be picked and next cluster will be formed, this will
>> go on till all the search results are exhausted.
>>
>> I have attached a file to geometrically illustrate the algorithm, please
>> have a look at it.
>>
>> On Wed, Mar 23, 2016 at 12:21 AM, Olly Betts <olly at survex.com> wrote:
>>
>>> On Tue, Mar 22, 2016 at 02:08:23PM +0530, MURTUZA BOHRA wrote:
>>> > How Latent semantic indexing would help?
>>> >
>>> > In LSI we project query (considering as a pseudo document) on to the
>>> > term-document vector space and based on some threshold we find the
>>> relevant
>>> > documents. Very similarly if we use LSI for clustering, and instead of
>>> > query if we take one of our search result and set different thresholds
>>> and
>>> > based on each threshold we can cluster the search result at single
>>> shot.
>>>
>>> So if I follow, you take one document (how do you decide which) and then
>>> generate a set of clusters as (multi-dimensional) rings around it of
>>> increasing radius?
>>>
>>> That doesn't sound like it's going to do a good job of producing useful
>>> clusters.  The group around the "seed" document is probably related,
>>> but once you get beyond that the documents in the cluster are defined
>>> only by distance from the seed.
>>>
>>> In geographical terms, locations which are < 10km from a given point
>>> might be a useful cluster, but locations between 10 and 20km from that
>>> point is much less likely to be.
>>>
>>> Cheers,
>>>     Olly
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160323/c91f8a4d/attachment.html>


More information about the Xapian-devel mailing list