[Xapian-discuss] Tika 0.8 failure rates
olly at survex.com
Thu Oct 6 06:30:44 BST 2011
On Wed, Oct 05, 2011 at 04:23:23PM +0100, James Aylett wrote:
> On 5 Oct 2011, at 15:38, Olly Betts wrote:
> > By default, omindex currently uses a list of extension->MIME
> > content-type mappings, and only consults the magic library for
> > extensions it doesn't know. So any file with a .doc extension will be
> > considered as application/msword (unless you run omindex with
> > '--mime-type=doc:').
> > This is a bit dubious as it's pretty common to find files with a .doc
> > extension which are actually RTF - that mechanism comes from before we
> > had libmagic support. I think it is worth keeping as libmagic doesn't
> > correctly identify every filetype, but we should probably trim the
> > default list a bit.
> Would it make sense to have a mode where libmagic is tried first, and
> if it fails to provide anything we can use we fall back to the
> internal table? We could configure it with something illegal at the
> start of a MIME type, such as '+'.
I suppose the question is if there are situations where libmagic says
it doesn't know, and the extension tells us the type, but not reliably
enough that we would want to just trust the extension. If there aren't
situations where it would really help, it's just complicating the code
and the mental model the user needs to build for no reason.
(If the extension is reliable, then we can just use it as we do now,
and it libmagic gives a wrong answer we wouldn't get to such a
It's definitely useful to have a list of "trustworthy" extensions which
is checked first, as there are cases where libmagic thinks it knows the
answer but is just wrong. The problem is usually due to a rule which is
considered before the correct one not being specific enough, for
This sort of bug seems to be depressingly common, mostly because a
lot of file formats lack reliable magic sequences.
For .doc at least, I think the best approach is to just ask libmagic.
More information about the Xapian-discuss