[Xapian-discuss] Tika 0.8 failure rates

James Aylett james-xapian at tartarus.org
Wed Oct 5 16:23:23 BST 2011


On 5 Oct 2011, at 15:38, Olly Betts wrote:

> By default, omindex currently uses a list of extension->MIME
> content-type mappings, and only consults the magic library for
> extensions it doesn't know.  So any file with a .doc extension will be
> considered as application/msword (unless you run omindex with
> '--mime-type=doc:').
> 
> This is a bit dubious as it's pretty common to find files with a .doc
> extension which are actually RTF - that mechanism comes from before we
> had libmagic support.  I think it is worth keeping as libmagic doesn't
> correctly identify every filetype, but we should probably trim the
> default list a bit.


Would it make sense to have a mode where libmagic is tried first, and if it fails to provide anything we can use we fall back to the internal table? We could configure it with something illegal at the start of a MIME type, such as '+'.

J

-- 
 James Aylett
 talktorex.co.uk - xapian.org - devfort.com




More information about the Xapian-discuss mailing list