[Xapian-discuss] Document snippet generation

Colin Bell colinabell at gmail.com
Wed Mar 19 21:05:46 GMT 2008


Hi Kevin

Sorry to hear your having a problem. My compiler info is

g++ (GCC) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)

As you can see it was developed and compiled on Ubuntu Linux. If you  
send me the errors I'll have a go at debugging it for you.

Regards

Colin

On 19 Mar 2008, at 20:54, Kevin Duraj wrote:

> Colin,
>
> Your code does not compile on Linux, I think it was written on Windows
> and I do not have much time to fix it. Even so, here is another great
> algorithm Gunning fog index.
> http://en.wikipedia.org/wiki/Gunning_fog_index
>
> Gunning fog index is designed to measure the readability of English
> writing. The resulting number is an indication of the number of years
> of formal education that a person requires in order to easily
> understand the text on the first reading. With Gunning fog index we
> could potentially measure the intelligence of a web page, assign boost
> value to it and get some great page ranking like Google does. :-)
>
> Kevin Duraj
> http://myhealthcare.com
>
>
> On Wed, Mar 19, 2008 at 12:37 PM, Colin Bell <colinabell at gmail.com>  
> wrote:
>>
>> Hi Kevin
>>
>> I did attach the source code to the original posting but it seems  
>> to not
>> made it through the mailing list. You can download it here. I am  
>> using on
>> our company search and its doing a good job and is pretty fast.  
>> Needs a bit
>> of tidying up and my C++ knowledge is very weak, could do with some  
>> help.
>>
>> I will do some reading on the link you sent, thanks.
>>
>> http://www.cbell.info/XapSum.zip
>> Regards
>> Colin
>>
>>
>> On 19 Mar 2008, at 18:29, Kevin Duraj wrote:
>> On Tue, Mar 18, 2008 at 7:15 AM, Colin Bell <colinabell at gmail.com>  
>> wrote:
>> Hi All
>>
>> Following on from a discussion that was flying around a while back
>> about document snippets (summaries). I have knocked together some
>> proof of concept code (C++) that uses the Xapian stemming ability and
>> sentence extraction (see http://en.wikipedia.org/wiki/Sentence_extraction)
>> . I also used the Open Text Summarizer project as an inspiration.
>>
>> It works quite well, but has some caveats which are explained in the
>> code comments. It can summarise, highlight sentences and highlight
>> words. It also has the ability to do context summaries. For example:
>> If you supply it with terms it will summarise the text within the
>> context of those terms.
>>
>> I am new to C++ programming so while your laughing out loud at the
>> poor coding, please keep that in mind. The code was assembled on an
>> Ubuntu Linux and comes with a Makefile. I have also supplied my
>> stopper class. For some reason the stopper still fails to stop some  
>> of
>> the words in the stopper (like "the") if anyone knows why, please let
>> me know.
>>
>> Feedback / comments / changes / improvements are more than welcome -
>> bring it on. I really hope this sparks an interest.
>>
>> Regards
>>
>> Colin
>>
>>
>> Colin!
>>
>> Great job, it definitely sparks an interest. Can you share the code  
>> with us,
>> or send the link where we can download it . I will run it against
>> myhealthcare.com 73 million document search engine using the sentence
>> summarizer, and we will see what kind of results we will get on the  
>> top.
>> Hopefully, we will get rid of web sites using excessive keywords  
>> stuffing
>> and spamdexing techniques.
>>
>> Did you have a chance to take a look at Flesh-Kincaid readability  
>> algorithm
>> design to measure comprehension difficulty in English language?
>> http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test
>>
>> Kevin Duraj
>> http://myhealthcare.com
>>
>>




More information about the Xapian-discuss mailing list