[Xapian-discuss] Indexing of email

Jim Lynch jim at fayettedigital.com
Sun Aug 20 17:39:59 BST 2006


James Aylett wrote:
> On Mon, Aug 21, 2006 at 12:01:30AM +1000, Michael Daly wrote:
>
>   
>> Does xapian index email as contained within an email (either linux
>> or windows) program? Please answer in regards to both the emails and
>> attachments.
>>     
>
> Xapian itself is a library for building applications that need
> indexing and search facilities. I have a script which will create
> omega-compatible indexes from mbox format email collections, which
> isn't really ready for prime time but I'm happy to send to anyone who
> is interested. It's in python, under the GPL. Let me know if you want
> a copy.
>
> James
>
>   
I also have a system to index email but it's not even as far a long as 
James' script.  Since I have multiple sources for my email it's a bit 
more complex than need be for a single mailbox.  It goes something like 
this:

I have two different sets of directories with Unix mailbox files on DVD, 
I ran hypermail (via a perl script to filter things) to convert from 
mbox format to html.  This does two things, one it provides me with a 
file/directory tree of one email per file that I can easily index and 
two, it give me a way to look at individual mail messages via a web 
interface.  Hypermail also adds attachments, but I filter out binary 
attachments so the files aren't so horribly big.

Essentially I do the same thing for a set of windows (Thunderbird) 
folders that I have archived mail store in also.  And once a day I do 
the same for a set of Linux Thunderbird mail folders. Hypermail reads 
both formats fine, since they are both in "mbox" format or close enough.

Then to index them into the Xapian database, I use find to enumberate 
all fo the files in the html directories created by hypermail.  List 
list is fed into a perl script that looks for html files, doc files, pdf 
files, etc.  I then use an appropriate converter to convert these to 
text, read them in and generate input for scriptindex.  I collect a 
number of sets of data for each file and then run scriptindex. 

I actually have 3 different Xapian databases, so I can selectively 
search the Irix set, the Windows set and the current Linux set.  The 
first two are static, but on a daily basis I run the current set.  Since 
it's not trivial to detect deleted mail messages, I just remove the 
whole html set and start over each night.  It takes a couple of hours, 
but since I'm sleeping and the computer isn't doing anything useful 
anyway, I don't care.

Someday I'll write up a simplified version of this and post it on the wiki.

Jim.



More information about the Xapian-discuss mailing list