[Xapian-discuss] Document snippet generation
alex at brasetvik.com
Wed Mar 19 16:50:26 GMT 2008
On Mar 18, 2008, at 15:15 , Colin Bell wrote:
> It works quite well, but has some caveats which are explained in the
> code comments.
> Feedback / comments / changes / improvements are more than welcome -
> bring it on. I really hope this sparks an interest.
//split the text into sentences using . ? ; | !
//there is a gotcha here by the fact that it catches things like 3.5 %
or £1.532 any ideas?
Seems you've glanced the problem that is sentence extraction. For a nice
introduction on the topic, see . The Natural Language Toolkit
implementation of a Punkt sentence tokenizer. It works quite well.
Also, there are many aspects of rating the best snippets found, such as:
* How many unique query terms have matched in the sentence? Paragraph?
* How many matches are in nearby paragraphs and in the same context?
* How long is the snippet compared to other snippets containing the
I've only done a proof-of-concept implementation of a snippet
where nltk did all the heavy work, so I'm sorry I don't have any code of
interest for you. However, I might at least save you some research if
you find these articles interesting as well:
* Query-biased summarization based on lexical chaining 
* Learning query-biased web page summarization 
* Fast generation of result snippets in web search 
 "What is a word, what is a sentence? Problems of tokenization", http://scholar.google.no/scholar?hl=no&lr=&cluster=2729755208272104985
 "Unsupervised Multilingual Sentence Boundary Detection", http://scholar.google.no/scholar?hl=no&lr=&cluster=10262315429933335261
More information about the Xapian-discuss