[Xapian-discuss] Text snippets

Peter Karman peter at peknet.com
Fri Jan 15 15:56:07 GMT 2010


Graham Kann wrote on 1/15/10 4:10 AM:
> Hello!
> 
> I'm new to the list, so please bare with me.
> 
> This is a follow-on to Shripad Bodas's question about snippets/highlighting.
> 
> I too have tried the Perl Search::Tools mode, and it works well.  Only
> problem is that it's slow - eg, when displaying results which exceeds (say)
> 50, the time taken to render the page (snipping and highlighting with
> Search::Tools) can actually exceed the time to perform the search itself...
> 

That's similar to what I see in my own use. Of course, "slow" is relative 
because Xapian search time is so fast. I can still process 50 results in less 
than a second, but perhaps your requirements are more rigorous.

> Seeing that our search frontend is coded in PHP, it makes sense to use PHP
> exclusively (calling a Perl routine works, but you pay a double penalty -
> the one mentioned above, plus the usual costs associated with calling
> Perl).  Is anyone aware of PHP code I can use to create excerpts/snippets
> and keyword highlighting (with usage of stemming of course)?
> 

If you do manage to get something working in PHP, I would be interested, since I 
too use PHP at $work.

IME, the problem is not the language but the algorithm: trade-offs between 
speed, flexibility and accuracy. Search::Tools targets accuracy first, 
flexibility second, and then tries to mitigate the speed issue by doing all the 
heavy lifting in C/XS. The pure Perl implementation is much slower, and I would 
imagine the same algorithm in PHP would not be any faster. S::T uses regex to do 
all the term matching, and while regex is fast, the process could be made much 
faster if (for example) the positional offsets of matching terms were known in 
advance (e.g. stored in the index) or by optimizing (e.g. caching) for 
particular applications (like Xapian). S::T basically re-parses the original 
text at search time, and since the price of doing that has already been paid at 
indexing time, it would be optimal to save the relevant information somehow. But 
S::T is agnostic wrt search library (I use it with Xapian, Swish-e, KinoSearch, 
etc) and so those optimizations haven't been a priority. KinoSearch (e.g.) does 
store the positional offsets and has snippet/highlighting code built-in. So does 
Lucene IIRC.

-- 
Peter Karman  .  http://peknet.com/  .  peter at peknet.com



More information about the Xapian-discuss mailing list