[Xapian-discuss] Text snippets

Do. do1 at yandex.ru
Thu Dec 24 19:22:36 GMT 2009


If anybody is to implement it for Xapian, what is the best strategy?

That is my guess:
1. User provide source text and parsed query, highlight prefix/suffix and count how much
 snippets she need.
2. Text is parsed again splitting by words (like index_text do), stemming, etc.
3. It should know original word start and end position and how it's parsed (and stemmed).
4. Match parsed word against query. (Easy for everyting except phrase).
There is another question what is best algo for choosing snippets.
5. For example we split source text by phrase, separated by point. If phrase
 have matched words add it to the list with weight of how much words are matched.
If there is more weighty sentence add it to the list, if not add if there is room. Highlighting can not
 be stopped early becasue it is always possible to have more weighty sentence ahead.
6. Then highlight collected phrases by adding user specified prefix/suffix at remembered positions.
Wow, that became really complicated. And this is not speed optimal algo, no help from indexes, for
 example 10 of 1M file reparsing should really load cpu.

Happy holidays.

24.12.09, 01:55, "Olly Betts" <olly at survex.com>:

> On Thu, Dec 17, 2009 at 11:29:52AM +0300, Do. wrote:
>  > Is there advancements in snippeting? (Besides what mentioned in the wiki.) I
>  > think extracting snippets is clearly IR task. And I hope Xapian will provide
>  > at least helpers to do that.
>  
>  I agree that it is a feature which would fit well in Xapian, but nobody has
>  yet implemented it.  I don't know of anybody currently working on it (and
>  since nobody else has responded to your post, I guess nobody is).
>  
>  There's a ticket in trac as well as the FAQ entry.  The FAQ entry had some
>  rough edges (e.g. the sample thread it linked to wasn't about snippets at all)
>  so I've overhauled it, and linked to the ticket as part of that:
>  
>  http://trac.xapian.org/wiki/FAQ/Snippets
>  
>  Cheers,
>      Olly
>  
>  



More information about the Xapian-discuss mailing list