[Xapian-discuss] Tika 0.8 failure rates
olly at survex.com
Wed Oct 5 15:38:22 BST 2011
On Thu, Sep 29, 2011 at 04:31:43PM +0530, Charles wrote:
> On 01/09/11 18:51, Olly Betts wrote:
> > It would be interesting to know how these compare with the failure rates
> > for other filter programs on the same set of documents.
> > Without anything to compare these to, it's hard to know if they're good
> > or bad. For example, perhaps all those 345 failed .doc files are
> > "readme.doc" and actually plain text. Or perhaps they are all valid and
> > would be read fine by a different filter.
> Sorry for delay; pressure of higher priorities including having typhoid
> The omindex command is run with MIME types. For .doc files the option is:
> --filter "application/msword:java -jar $tika_jar --text
> so it seems unlikely that the failed .doc files are anything other than
> Word files.
By default, omindex currently uses a list of extension->MIME
content-type mappings, and only consults the magic library for
extensions it doesn't know. So any file with a .doc extension will be
considered as application/msword (unless you run omindex with
This is a bit dubious as it's pretty common to find files with a .doc
extension which are actually RTF - that mechanism comes from before we
had libmagic support. I think it is worth keeping as libmagic doesn't
correctly identify every filetype, but we should probably trim the
default list a bit.
> I plan to enhance the bash script that runs omindex, making the filters
> configurable. When that is done I plan to try a variety of filters and
> compare results. It may be that various filters fail on a different
> selection of files so this may give better coverage in the index.
> Would you like a few of the failing .doc files to investigate?
Feel free to send a few, though I'm crazily busy at the moment so might
not manage to investigate for a while. If they're OK to make public
feel free to attach them to a ticket in trac which will allow others to
More information about the Xapian-discuss