[Xapian-discuss] custom
gervin23
gervin23 at fastmail.fm
Tue Dec 21 20:11:47 GMT 2004
hello,
i was playing with various ideas around hit-highlighting and have a
version via python that parses the documents (html in this case) and
performs regx to surround the highlighted terms with a <span> tag. it
works for the moment and is quite speedy but i can't imagine the server
could handle it nicely under load. so my next idea isn't possible with
xapian at the moment (i don't think?) but was wondering if there's a
workaround, patch, or better idea. basically, what i'd like is to add a
4th parameter to add_posting(term, pos, wdf) where i could store
something like the byte offset of the term (as opposed to the position
list value).
for example, html like the following would take into account where the
term(s) physically exist in the document while taking into account tags,
stopwords ('by' in this case), etc:
<html><title>Search by Document</title></html>
where my indexing calls would look something like:
add_posting('search',1,1,13) <- position=1,byteoffset=13
add_posting('document',2,1,23) <- position=2,byteoffset=23
note, it might also serve better to use octal or hex for this purpose.
any ideas greatly appreciated.
one other question i have has to do with 2 term phrase searches. i find
these particular searches magnitudes longer than 3+ term phrase searches
(sometimes 80 seconds). it seems the more terms i add, the faster the
results. for these tests, i performed all new searches each time (trying
to workaround the cache system) and found the results pretty consistent.
any ideas as to why this might be happening? also, i dug a little into
how the storage system was behaving while doing these searches and found
xapian using less than 1% of available throughput (cpu pretty much
idle). now, this is on a regular desktop system so if that's the case,
how would a RAID'd system help? i'm most likely missing something here
so a little more insight would help tremendously.
thanks much,
andrew
More information about the Xapian-discuss
mailing list