[Xapian-discuss] Tika 0.8 failure rates

Charles xapian at catcons.co.uk
Thu Sep 29 12:01:43 BST 2011


On 01/09/11 18:51, Olly Betts wrote:
> On Tue, Aug 09, 2011 at 09:14:20PM +0530, Charles wrote:
>> FYI, here is a list of apparent Tika 0.8 conversion failures when run
>> from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory:
>>
>>  doc files: tried: 10268, failed: 345  3.35%
>> docx files: tried:   248, failed:   0
>>  odp files: tried:     7, failed:   0
>>  ods files: tried:    71, failed:   0
>>  odt files: tried:   136, failed:   0
>>  pdf files: tried:  3888, failed: 150  3.85%
>>  pps files: tried:    29, failed:   3 10.34%
>> ppsx files: tried:    12, failed:   0
>>  ppt files: tried:   331, failed:   0
>> pptx files: tried:    24, failed:   0
>>  rtf files: tried:   698, failed:   1   .14%
>>  xls files: tried:  3339, failed:   2   .05%
>> xlsx files: tried:    63, failed:   0
>>
>> The statistics were generated by searching omindex output for
>> .$ext" failed
>> where $ext was each of the listed extensions in turn.
>>
>> More information can be supplied on request.
> 
> It would be interesting to know how these compare with the failure rates
> for other filter programs on the same set of documents.
> 
> Without anything to compare these to, it's hard to know if they're good
> or bad.  For example, perhaps all those 345 failed .doc files are
> "readme.doc" and actually plain text.  Or perhaps they are all valid and
> would be read fine by a different filter.
> 
> Cheers,
>     Olly

Hello Olly :-)

Sorry for delay; pressure of higher priorities including having typhoid
fever.

The omindex command is run with MIME types.  For .doc files the option is:
--filter "application/msword:java -jar $tika_jar --text
so it seems unlikely that the failed .doc files are anything other than
Word files.

I plan to enhance the bash script that runs omindex, making the filters
configurable.  When that is done I plan to try a variety of filters and
compare results.  It may be that various filters fail on a different
selection of files so this may give better coverage in the index.

Would you like a few of the failing .doc files to investigate?

Best

Charles



More information about the Xapian-discuss mailing list