[Xapian-discuss] search result context.
Jim Lynch
jim at fayettedigital.com
Wed Jan 25 21:14:44 GMT 2006
Just because I've always wanted to and had a desire based on a current
project, I tackled the context problem. I'm implementing a mail search
utility.
I should start at the beginning. I was laid off last year but had a
number of emails that I wanted to keep, not really work related. I
copied the files to a CD before I left. And I've just moved from
Windows to Linux desktop and found that Thunderbird didn't really want
to use my old mail files so I had to find a way to sort them out. Enter
Xapian.
To spare you most of the details, I have a perl cgi script that presents
me with two search boxes, title and body. I've indexed the author too,
but haven't implemented that yet. I also can sort on author, date (and
reverse) relevance, title and collection. Collection is which of the
three databases I'm interested in, Linux, Windows, or Work. I also have
a AND checkbox.
I used scriptindex to generate the search dbs and am using the XML
template, somewhat modified to get the results. I form a call to
localhost/cgi-bin/omega in the perl cgi script and use Simple::XML to
crack the results. Once I have the terms I take each of the files that
the search indicated and find the lines in that file that contain at
least one of the terms. I save that line in addition to the next line
so I'm sure I have a few words on both sides. When I have a set of
lines, I break them into words at whitespace and start searching one
word at a time for a match. Then I take the previous 4 words and the
next 4 words and make a phrase from them. I don't reuse any words. If
they were in the previous phase, They won't be in this one. Probably
not ideal, but works. The biggest problem is stemming. If I search for
the word wanted, I get wanting, want, wants, etc. I could toss that
result away if it didn't really match, but I've elected to highlight the
stemmed word and go on in that case. Anyway it took me maybe 4 hours to
complete the task and while it isn't google perfect, it seems to work
for me. I'm running it on an AMD 2200, 776108 kB memory, big IDE disk.
Nothing special and it is not noticeably slower than before I inserted
the context generation. I don't have any large files to deal with,
although. Mail messages don't run into megabytes usually.
I'm happy with the results and since I'm willing to put up with less
than perfect results, it works fine. It is better than the simple
"sample" that omegascript generates.
Jim.
More information about the Xapian-discuss
mailing list