[Xapian-discuss] Document snippet generation

Wed Mar 19 16:50:26 GMT 2008

On Mar 18, 2008, at 15:15 , Colin Bell wrote:

> It works quite well, but has some caveats which are explained in the  
> code comments.

(...)

> Feedback / comments / changes / improvements are more than welcome -  
> bring it on. I really hope this sparks an interest.

//split the text into sentences using . ? ; | !
//there is a gotcha here by the fact that it catches things like 3.5 %  
or £1.532 any ideas?

Seems you've glanced the problem that is sentence extraction. For a nice
introduction on the topic, see [1]. The Natural Language Toolkit[2]  
has an
implementation of a Punkt sentence tokenizer[3][4]. It works quite well.

Also, there are many aspects of rating the best snippets found, such as:

  * How many unique query terms have matched in the sentence? Paragraph?
    Snippet?
  * How many matches are in nearby paragraphs and in the same context?
  * How long is the snippet compared to other snippets containing the
    same terms?

I've only done a proof-of-concept implementation of a snippet  
highlighter,
where nltk did all the heavy work, so I'm sorry I don't have any code of
interest for you. However, I might at least save you some research if
you find these articles interesting as well:

  * Query-biased summarization based on lexical chaining [5]
  * Learning query-biased web page summarization [6]
  * Fast generation of result snippets in web search [7]

~

[1] "What is a word, what is a sentence? Problems of tokenization", http://scholar.google.no/scholar?hl=no&lr=&cluster=2729755208272104985
[2] http://nltk.org/
[3] http://nltk.org/doc/api/nltk.tokenize.punkt-pysrc.html
[4] "Unsupervised Multilingual Sentence Boundary Detection", http://scholar.google.no/scholar?hl=no&lr=&cluster=10262315429933335261
[5] http://scholar.google.no/scholar?hl=no&lr=&cluster=6956910332766837477
[6] http://portal.acm.org/citation.cfm?id=1321518
[7] http://scholar.google.no/scholar?hl=no&lr=&cluster=16293287331469923302

--
Alex Brasetvik