[Xapian-discuss] Tika 0.8 failure rates

Olly Betts olly at survex.com
Thu Sep 1 14:21:15 BST 2011


On Tue, Aug 09, 2011 at 09:14:20PM +0530, Charles wrote:
> FYI, here is a list of apparent Tika 0.8 conversion failures when run
> from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory:
> 
>  doc files: tried: 10268, failed: 345  3.35%
> docx files: tried:   248, failed:   0
>  odp files: tried:     7, failed:   0
>  ods files: tried:    71, failed:   0
>  odt files: tried:   136, failed:   0
>  pdf files: tried:  3888, failed: 150  3.85%
>  pps files: tried:    29, failed:   3 10.34%
> ppsx files: tried:    12, failed:   0
>  ppt files: tried:   331, failed:   0
> pptx files: tried:    24, failed:   0
>  rtf files: tried:   698, failed:   1   .14%
>  xls files: tried:  3339, failed:   2   .05%
> xlsx files: tried:    63, failed:   0
> 
> The statistics were generated by searching omindex output for
> .$ext" failed
> where $ext was each of the listed extensions in turn.
> 
> More information can be supplied on request.

It would be interesting to know how these compare with the failure rates
for other filter programs on the same set of documents.

Without anything to compare these to, it's hard to know if they're good
or bad.  For example, perhaps all those 345 failed .doc files are
"readme.doc" and actually plain text.  Or perhaps they are all valid and
would be read fine by a different filter.

Cheers,
    Olly



More information about the Xapian-discuss mailing list