[Xapian-discuss] omega and "text/x-mail" support

Olly Betts olly at survex.com
Tue Dec 16 22:46:23 GMT 2014


On Tue, Dec 16, 2014 at 10:04:47PM +0100, Emmanuel Garette wrote:
> Le 15/12/2014 23:22, Olly Betts a écrit :
> > The approach I took to attachments was to index them as separate
> > documents - if I follow you correctly, you seem to be trying to treat
> > them as part of a single document.  Is there a particular reason why
> > you are taking that approach?
> >
> > I don't think my code is anywhere public currently, but I can rebase
> > it onto current master and put it on a git branch if it's potentially
> > useful to others in its current form.
> In my opinion, one file is a document. But maybe I'm wrong.

For a zip or tar file, I don't think the contents are typically any more
closely related than files in the same directory are.

For email attachments, I think it's less clear and each approach is
arguably better in some cases and worse in others, but separating them
at least makes it easier to handle the meta-data.  E.g. say I want to
search for PDFs and collapse duplicates - both parts are hard to
satisfactorily achieve for the same PDF on disk and attached to an email
if the attachment was indexed as part of the email.

> The problem is that we cannot construct path (prefixed by U) in this case.
> How deal with path if an email could generate more than one document?
> Something like "U/path/to/mail|Attached.pdf"? Or we could add a new prefix?

For attachments, I'm appending "#" and a counter (U is a URL, so any "#"
actually in the filename gets encoded as "%23".  The filename of the
attachment is not necessarily unique as an email can have two
attachments with the same filename.

For archives, I just append "/" and then the path inside the archive.

The issue then is how to link to such sub-files.

For zip files, the chosen URL format allows for dynamic unpacking by the
webserver - e.g. there used to be mod-unzip for apache, though it seems
to have died:

https://web.archive.org/web/20070208164202/http://nobits.org/mod-unzip/

In the absence of such cleverness, it falls back to returning the zip
file.

For emails, I'm currently trimming off the fragment in the template and
just linking to the message, but some server-side cleverness is
possible here too.

> I'm interesting by your work on indexing archives to understand how you
> extect to build path.

I'll try to rebase my changes and put them on a branch, but the above
should give you an idea.

Cheers,
    Olly



More information about the Xapian-discuss mailing list