[Xapian-discuss] omega and "text/x-mail" support

Jean-Francois Dockes jf at dockes.org
Tue Dec 16 22:06:56 GMT 2014


Emmanuel Garette writes:
 > Le 15/12/2014 23:22, Olly Betts a écrit :
 > > On Sat, Dec 13, 2014 at 08:32:58PM +0100, Emmanuel Garette wrote:
 > >> I would like to add "text/x-mail" support to omega. I'm using mhonarc to
 > >> export mail to HTML format and I'm using HTML parse to index mail
 > >> content (largely inspired by "application/vnd.ms-outlook" format).
 > >>
 > >> The problem is that files attached to the mail are not indexing at all.
 > >> I think it's not possible in "index_file" function to index 2 files as
 > >> one document.
 > >>
 > >> I can't find easily solution for my problem. I think I must spit this
 > >> function to separate document's creation and file indexing.
 > > I've done some work on indexing attachments and files inside archives
 > > (like tar and zip files), but I haven't merged it yet as it's not
 > > entirely satisfactory in various ways, most of which require some
 > > refactoring of omindex to address.
 > >
 > > The approach I took to attachments was to index them as separate
 > > documents - if I follow you correctly, you seem to be trying to treat
 > > them as part of a single document.  Is there a particular reason why
 > > you are taking that approach?
 > >
 > > I don't think my code is anywhere public currently, but I can rebase
 > > it onto current master and put it on a git branch if it's potentially
 > > useful to others in its current form.
 >
 > In my opinion, one file is a document. But maybe I'm wrong.
 > The problem is that we cannot construct path (prefixed by U) in this case.
 > How deal with path if an email could generate more than one document?
 > Something like "U/path/to/mail|Attached.pdf"? Or we could add a new prefix?

Maybe you could have a look at how Recoll recursively indexes subdocuments?
You could take this either as an inspiration or as an example of what not
to do...

In many cases, a file is not a document in the common sense of the term.
(an MS-Word document stored as an attachment to an e-mail message inside
a Thunderbird folder archived in a Zip file as the web site says :) )

jf



More information about the Xapian-discuss mailing list