[Xapian-discuss] search result context.

Wed Jan 25 21:14:44 GMT 2006

Just because I've always wanted to and had a desire based on a current 
project, I tackled the context problem.  I'm implementing a mail search 
utility. 

I should start at the beginning.  I was laid off last year but had a 
number of emails that I wanted to keep, not really work related.  I 
copied the files to a CD before I left.  And I've just moved from 
Windows to Linux desktop and found that Thunderbird didn't really want 
to use my old mail files so I had to find a way to sort them out.  Enter 
Xapian.

To spare you most of the details, I have a perl cgi script that presents 
me with two search boxes, title and body.  I've indexed the author too, 
but haven't implemented that yet.  I also can sort on author, date (and 
reverse) relevance, title and collection.  Collection is which of the 
three databases I'm interested in, Linux, Windows, or Work. I also have 
a AND checkbox. 

I used scriptindex to generate the search dbs and am using the XML 
template, somewhat modified to get the results.  I form a call to 
localhost/cgi-bin/omega in the perl cgi script and use Simple::XML to 
crack the results.  Once I have the terms I take each of the files that 
the search indicated and find the lines in that file that contain at 
least one of the terms.  I save that line in addition to the next line 
so I'm sure I have a few words on both sides.  When I have a set of 
lines, I break them into words at whitespace and start searching one 
word at a time for a match.  Then I take the previous 4 words and the 
next 4 words and make a phrase from them.  I don't reuse any words.  If 
they were in the previous phase, They won't be in this one.  Probably 
not ideal, but works.  The biggest problem is stemming.  If I search for 
the word wanted, I get wanting, want, wants, etc.  I could toss that 
result away if it didn't really match, but I've elected to highlight the 
stemmed word and go on in that case.  Anyway it took me maybe 4 hours to 
complete the task and while it isn't google perfect, it seems to work 
for me.  I'm running it on an AMD 2200, 776108 kB memory, big IDE disk.  
Nothing special and it is not noticeably slower than before I inserted 
the context generation.  I don't have any large files to deal with, 
although.  Mail messages don't run into megabytes usually.

I'm happy with the results and since I'm willing to put up with less 
than perfect results, it works fine.   It is better than the simple 
"sample" that omegascript generates.

Jim.