[Xapian-discuss] tiff / image pdf filter

Fri Mar 20 03:11:18 GMT 2009

Tessract is better than gocr.
At the time I had trouble getting gocr to understand multipage tiff's 
but tesseract seem to do it find.
There's another project called Ocropus that works with tesseract.

http://sites.google.com/site/ocropus/

"/CRopus(tm) is a state-of-the-art document analysis and OCR system, 
featuring pluggable layout analysis, pluggable character recognition, 
statistical natural language modeling, and multi-lingual capabilities.

The system is being developed with the generous support from Google and 
other organizations; the primary developers are at the IUPR Research 
Group at the DFKI Research Center."
/
They claim that Ocropus can attain error rates silimar to those found in 
Abbyfine reader and Omipage.

Unless you proofread the documents I guess there is a good chance there 
will be some garbage data, would you say that's it'd detrimental to the 
index and should be left out?

Olly Betts wrote:
> On Thu, Mar 19, 2009 at 03:36:47PM +1030, Frank J Bruzzaniti wrote:
>   
>> I've been experimenting using tesseract to OCR tiff's with omega just 
>> using the tesseract binary package from Ubuntu.
>>     
>
> Is tesseract better than gocr?  In the previous discussion of this I
> noted that gocr generates random junk from logos and graphics, and XML
> tags for barcodes, and the pipeline used didn't handle multi-page
> documents:
>
> http://thread.gmane.org/gmane.comp.search.xapian.general/6336/focus
>
>   
>> The one issue I find is that tesseract is sooo slow.
>>
>> One work around so ocr'ing doesn't hold up omindex would be to maintain 
>> a separate instance of omindex and a separate database of ocr'd data 
>> then allow them both to be searched via the "stub database" method.  I'd 
>> definatly wanna use last_mod patch here so I don't have to re-ocr.
>>
>> Dose this sound reasonable,  if anyone has any better solutions I;d love 
>> to hear of them.  Once I've got it sorted I'll submit a patch. Maybe we 
>> could have a flag for omindex to it knows if it's designated just to ocr 
>> tiff's. 
>>     
>
> You don't actually need new flags for this - you can just specify -M
> flags to disable subsets of mimetypes for each indexing run.
>
> That's quiet fiddly for the "everything but tiffs", but perhaps the
> best way to deal with that is to add a "don't add the default mime
> mapping" option, rather than something very specific to OCRing tiffs.
>
>   
>> I guess we could also ocr image pdf's if they comeback with no data from 
>> the regular pdf filter.
>>     
>
> We've been here before too - see the post linked to above.
>
> Cheers,
>     Olly
>