[Xapian-discuss] Document snippet generation

Wed Mar 19 19:37:12 GMT 2008

Hi Kevin

I did attach the source code to the original posting but it seems to  
not made it through the mailing list. You can download it here. I am  
using on our company search and its doing a good job and is pretty  
fast. Needs a bit of tidying up and my C++ knowledge is very weak,  
could do with some help.

I will do some reading on the link you sent, thanks.

http://www.cbell.info/XapSum.zip

Regards

Colin

On 19 Mar 2008, at 18:29, Kevin Duraj wrote:

> On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell at gmail.com>  
> wrote:
>> Hi All
>>
>> Following on from a discussion that was flying around a while back
>> about document snippets (summaries). I have knocked together some
>> proof of concept code (C++) that uses the Xapian stemming ability and
>> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction)
>> . I also used the Open Text Summarizer project as an inspiration.
>>
>> It works quite well, but has some caveats which are explained in the
>> code comments. It can summarise, highlight sentences and highlight
>> words. It also has the ability to do context summaries. For example:
>> If you supply it with terms it will summarise the text within the
>> context of those terms.
>>
>> I am new to C++ programming so while your laughing out loud at the
>> poor coding, please keep that in mind. The code was assembled on an
>> Ubuntu Linux and comes with a Makefile. I have also supplied my
>> stopper class. For some reason the stopper still fails to stop some  
>> of
>> the words in the stopper (like "the") if anyone knows why, please let
>> me know.
>>
>> Feedback / comments / changes / improvements are more than welcome -
>> bring it on. I really hope this sparks an interest.
>>
>> Regards
>>
>> Colin
>>
>
> Colin!
>
> Great job, it definitely sparks an interest. Can you share the code  
> with us, or send the link where we can download it . I will run it  
> against myhealthcare.com 73 million document search engine using the  
> sentence summarizer, and we will see what kind of results we will  
> get on the top.  Hopefully, we will get rid of web sites using  
> excessive keywords stuffing and spamdexing techniques.
>
> Did you have a chance to take a look at Flesh-Kincaid readability  
> algorithm design to measure comprehension difficulty in English  
> language?
> http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test
>
> Kevin Duraj
> http://myhealthcare.com