[Xapian-discuss] Document snippet generation

Wed Mar 19 20:54:22 GMT 2008

Colin,

Your code does not compile on Linux, I think it was written on Windows
and I do not have much time to fix it. Even so, here is another great
algorithm Gunning fog index.
http://en.wikipedia.org/wiki/Gunning_fog_index

Gunning fog index is designed to measure the readability of English
writing. The resulting number is an indication of the number of years
of formal education that a person requires in order to easily
understand the text on the first reading. With Gunning fog index we
could potentially measure the intelligence of a web page, assign boost
value to it and get some great page ranking like Google does. :-)

Kevin Duraj
http://myhealthcare.com

On Wed, Mar 19, 2008 at 12:37 PM, Colin Bell <colinabell at gmail.com> wrote:
>
> Hi Kevin
>
> I did attach the source code to the original posting but it seems to not
> made it through the mailing list. You can download it here. I am using on
> our company search and its doing a good job and is pretty fast. Needs a bit
> of tidying up and my C++ knowledge is very weak, could do with some help.
>
> I will do some reading on the link you sent, thanks.
>
> http://www.cbell.info/XapSum.zip
> Regards
> Colin
>
>
> On 19 Mar 2008, at 18:29, Kevin Duraj wrote:
> On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell at gmail.com> wrote:
> Hi All
>
> Following on from a discussion that was flying around a while back
> about document snippets (summaries). I have knocked together some
> proof of concept code (C++) that uses the Xapian stemming ability and
> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction)
> . I also used the Open Text Summarizer project as an inspiration.
>
> It works quite well, but has some caveats which are explained in the
> code comments. It can summarise, highlight sentences and highlight
> words. It also has the ability to do context summaries. For example:
> If you supply it with terms it will summarise the text within the
> context of those terms.
>
> I am new to C++ programming so while your laughing out loud at the
> poor coding, please keep that in mind. The code was assembled on an
> Ubuntu Linux and comes with a Makefile. I have also supplied my
> stopper class. For some reason the stopper still fails to stop some of
> the words in the stopper (like "the") if anyone knows why, please let
> me know.
>
> Feedback / comments / changes / improvements are more than welcome -
> bring it on. I really hope this sparks an interest.
>
> Regards
>
> Colin
>
>
> Colin!
>
> Great job, it definitely sparks an interest. Can you share the code with us,
> or send the link where we can download it . I will run it against
> myhealthcare.com 73 million document search engine using the sentence
> summarizer, and we will see what kind of results we will get on the top.
> Hopefully, we will get rid of web sites using excessive keywords stuffing
> and spamdexing techniques.
>
> Did you have a chance to take a look at Flesh-Kincaid readability algorithm
> design to measure comprehension difficulty in English language?
> http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test
>
> Kevin Duraj
> http://myhealthcare.com
>
>