[Xapian-discuss] custom

gervin23 gervin23 at fastmail.fm
Tue Dec 21 20:11:47 GMT 2004


i was playing with various ideas around hit-highlighting and have a 
version via python that parses the documents (html in this case) and 
performs regx to surround the highlighted terms with a <span> tag. it 
works for the moment and is quite speedy but i can't imagine the server 
could handle it nicely under load. so my next idea isn't possible with 
xapian at the moment (i don't think?) but was wondering if there's a 
workaround, patch, or better idea. basically, what i'd like is to add a 
4th parameter to add_posting(term, pos, wdf) where i could store 
something like the byte offset of the term (as opposed to the position 
list value).

for example, html like the following would take into account where the 
term(s) physically exist in the document while taking into account tags, 
stopwords ('by' in this case), etc:
<html><title>Search by Document</title></html>

where my indexing calls would look something like:
add_posting('search',1,1,13) <- position=1,byteoffset=13
add_posting('document',2,1,23) <- position=2,byteoffset=23

note, it might also serve better to use octal or hex for this purpose. 
any ideas greatly appreciated.

one other question i have has to do with 2 term phrase searches. i find 
these particular searches magnitudes longer than 3+ term phrase searches 
(sometimes 80 seconds). it seems the more terms i add, the faster the 
results. for these tests, i performed all new searches each time (trying 
to workaround the cache system) and found the results pretty consistent. 
any ideas as to why this might be happening? also, i dug a little into 
how the storage system was behaving while doing these searches and found 
xapian using less than 1% of available throughput (cpu pretty much 
idle). now, this is on a regular desktop system so if that's the case, 
how would a RAID'd system help? i'm most likely missing something here 
so a little more insight would help tremendously.

thanks much,

More information about the Xapian-discuss mailing list