[Xapian-discuss] Document snippet generation

Kevin Duraj kevin.softdev at gmail.com
Wed Mar 19 18:29:31 GMT 2008


On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell at gmail.com> wrote:
> Hi All
>
> Following on from a discussion that was flying around a while back
> about document snippets (summaries). I have knocked together some
> proof of concept code (C++) that uses the Xapian stemming ability and
> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction)
> . I also used the Open Text Summarizer project as an inspiration.
>
> It works quite well, but has some caveats which are explained in the
> code comments. It can summarise, highlight sentences and highlight
> words. It also has the ability to do context summaries. For example:
> If you supply it with terms it will summarise the text within the
> context of those terms.
>
> I am new to C++ programming so while your laughing out loud at the
> poor coding, please keep that in mind. The code was assembled on an
> Ubuntu Linux and comes with a Makefile. I have also supplied my
> stopper class. For some reason the stopper still fails to stop some of
> the words in the stopper (like "the") if anyone knows why, please let
> me know.
>
> Feedback / comments / changes / improvements are more than welcome -
> bring it on. I really hope this sparks an interest.
>
> Regards
>
> Colin
>

Colin!

Great job, it definitely sparks an interest. Can you share the code with us, or send the link where we can download it . I will run it against myhealthcare.com 73 million document search engine using the sentence summarizer, and we will see what kind of results we will get on the top.  Hopefully, we will get rid of web sites using excessive keywords stuffing and spamdexing techniques.

Did you have a chance to take a look at Flesh-Kincaid readability algorithm design to measure comprehension difficulty in English language?
http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test

Kevin Duraj
http://myhealthcare.com



More information about the Xapian-discuss mailing list