[Xapian-discuss] omega and "text/x-mail" support
Jean-Francois Dockes
jf at dockes.org
Tue Dec 16 22:06:56 GMT 2014
Emmanuel Garette writes:
> Le 15/12/2014 23:22, Olly Betts a écrit :
> > On Sat, Dec 13, 2014 at 08:32:58PM +0100, Emmanuel Garette wrote:
> >> I would like to add "text/x-mail" support to omega. I'm using mhonarc to
> >> export mail to HTML format and I'm using HTML parse to index mail
> >> content (largely inspired by "application/vnd.ms-outlook" format).
> >>
> >> The problem is that files attached to the mail are not indexing at all.
> >> I think it's not possible in "index_file" function to index 2 files as
> >> one document.
> >>
> >> I can't find easily solution for my problem. I think I must spit this
> >> function to separate document's creation and file indexing.
> > I've done some work on indexing attachments and files inside archives
> > (like tar and zip files), but I haven't merged it yet as it's not
> > entirely satisfactory in various ways, most of which require some
> > refactoring of omindex to address.
> >
> > The approach I took to attachments was to index them as separate
> > documents - if I follow you correctly, you seem to be trying to treat
> > them as part of a single document. Is there a particular reason why
> > you are taking that approach?
> >
> > I don't think my code is anywhere public currently, but I can rebase
> > it onto current master and put it on a git branch if it's potentially
> > useful to others in its current form.
>
> In my opinion, one file is a document. But maybe I'm wrong.
> The problem is that we cannot construct path (prefixed by U) in this case.
> How deal with path if an email could generate more than one document?
> Something like "U/path/to/mail|Attached.pdf"? Or we could add a new prefix?
Maybe you could have a look at how Recoll recursively indexes subdocuments?
You could take this either as an inspiration or as an example of what not
to do...
In many cases, a file is not a document in the common sense of the term.
(an MS-Word document stored as an attachment to an e-mail message inside
a Thunderbird folder archived in a Zip file as the web site says :) )
jf
More information about the Xapian-discuss
mailing list