[Xapian-discuss] Tika 0.8 failure rates
Charles
xapian at catcons.co.uk
Thu Sep 29 12:01:43 BST 2011
On 01/09/11 18:51, Olly Betts wrote:
> On Tue, Aug 09, 2011 at 09:14:20PM +0530, Charles wrote:
>> FYI, here is a list of apparent Tika 0.8 conversion failures when run
>> from Xapian's omindex on a Debian 6 Squeeze 64-bit system with 4 GB memory:
>>
>> doc files: tried: 10268, failed: 345 3.35%
>> docx files: tried: 248, failed: 0
>> odp files: tried: 7, failed: 0
>> ods files: tried: 71, failed: 0
>> odt files: tried: 136, failed: 0
>> pdf files: tried: 3888, failed: 150 3.85%
>> pps files: tried: 29, failed: 3 10.34%
>> ppsx files: tried: 12, failed: 0
>> ppt files: tried: 331, failed: 0
>> pptx files: tried: 24, failed: 0
>> rtf files: tried: 698, failed: 1 .14%
>> xls files: tried: 3339, failed: 2 .05%
>> xlsx files: tried: 63, failed: 0
>>
>> The statistics were generated by searching omindex output for
>> .$ext" failed
>> where $ext was each of the listed extensions in turn.
>>
>> More information can be supplied on request.
>
> It would be interesting to know how these compare with the failure rates
> for other filter programs on the same set of documents.
>
> Without anything to compare these to, it's hard to know if they're good
> or bad. For example, perhaps all those 345 failed .doc files are
> "readme.doc" and actually plain text. Or perhaps they are all valid and
> would be read fine by a different filter.
>
> Cheers,
> Olly
Hello Olly :-)
Sorry for delay; pressure of higher priorities including having typhoid
fever.
The omindex command is run with MIME types. For .doc files the option is:
--filter "application/msword:java -jar $tika_jar --text
so it seems unlikely that the failed .doc files are anything other than
Word files.
I plan to enhance the bash script that runs omindex, making the filters
configurable. When that is done I plan to try a variety of filters and
compare results. It may be that various filters fail on a different
selection of files so this may give better coverage in the index.
Would you like a few of the failing .doc files to investigate?
Best
Charles
More information about the Xapian-discuss
mailing list