[Xapian-discuss] Document snippet generation
Alex Brasetvik
alex at brasetvik.com
Wed Mar 19 16:50:26 GMT 2008
On Mar 18, 2008, at 15:15 , Colin Bell wrote:
> It works quite well, but has some caveats which are explained in the
> code comments.
(...)
> Feedback / comments / changes / improvements are more than welcome -
> bring it on. I really hope this sparks an interest.
//split the text into sentences using . ? ; | !
//there is a gotcha here by the fact that it catches things like 3.5 %
or £1.532 any ideas?
Seems you've glanced the problem that is sentence extraction. For a nice
introduction on the topic, see [1]. The Natural Language Toolkit[2]
has an
implementation of a Punkt sentence tokenizer[3][4]. It works quite well.
Also, there are many aspects of rating the best snippets found, such as:
* How many unique query terms have matched in the sentence? Paragraph?
Snippet?
* How many matches are in nearby paragraphs and in the same context?
* How long is the snippet compared to other snippets containing the
same terms?
I've only done a proof-of-concept implementation of a snippet
highlighter,
where nltk did all the heavy work, so I'm sorry I don't have any code of
interest for you. However, I might at least save you some research if
you find these articles interesting as well:
* Query-biased summarization based on lexical chaining [5]
* Learning query-biased web page summarization [6]
* Fast generation of result snippets in web search [7]
~
[1] "What is a word, what is a sentence? Problems of tokenization", http://scholar.google.no/scholar?hl=no&lr=&cluster=2729755208272104985
[2] http://nltk.org/
[3] http://nltk.org/doc/api/nltk.tokenize.punkt-pysrc.html
[4] "Unsupervised Multilingual Sentence Boundary Detection", http://scholar.google.no/scholar?hl=no&lr=&cluster=10262315429933335261
[5] http://scholar.google.no/scholar?hl=no&lr=&cluster=6956910332766837477
[6] http://portal.acm.org/citation.cfm?id=1321518
[7] http://scholar.google.no/scholar?hl=no&lr=&cluster=16293287331469923302
--
Alex Brasetvik
More information about the Xapian-discuss
mailing list