[Xapian-discuss] Text snippets

Do. do1 at yandex.ru
Thu Dec 17 08:29:52 GMT 2009


Is there advancements in snippeting? (Besides what mentioned in the wiki.) I think extracting snippets is clearly IR task. And I hope Xapian will provide at least helpers to do that. I have set of documents up to 5M of extracted text and 1M in average (they are even bigger pdfs but I pre-extracted text into some sort of text cache, pdftotext is very slow). To parse ~1M documents on the fly for 10 documents to show probably too cpu/disk intensive (10M disk io and parsing just to show single search results page seems not perl task). But I can bear sizes. More hard is to correctly locate snippets for user entered query. Maybe Xapian can provide such functionality? (Locate best matched snippets in the text for the query string.) I think absence of snippeting is currently major weakness to be solved.

