[Xapian-discuss] tiff / image pdf filter
Frank J Bruzzaniti
frank.bruzzaniti at gmail.com
Fri Mar 20 03:11:18 GMT 2009
Tessract is better than gocr.
At the time I had trouble getting gocr to understand multipage tiff's
but tesseract seem to do it find.
There's another project called Ocropus that works with tesseract.
http://sites.google.com/site/ocropus/
"/CRopus(tm) is a state-of-the-art document analysis and OCR system,
featuring pluggable layout analysis, pluggable character recognition,
statistical natural language modeling, and multi-lingual capabilities.
The system is being developed with the generous support from Google and
other organizations; the primary developers are at the IUPR Research
Group at the DFKI Research Center."
/
They claim that Ocropus can attain error rates silimar to those found in
Abbyfine reader and Omipage.
Unless you proofread the documents I guess there is a good chance there
will be some garbage data, would you say that's it'd detrimental to the
index and should be left out?
Olly Betts wrote:
> On Thu, Mar 19, 2009 at 03:36:47PM +1030, Frank J Bruzzaniti wrote:
>
>> I've been experimenting using tesseract to OCR tiff's with omega just
>> using the tesseract binary package from Ubuntu.
>>
>
> Is tesseract better than gocr? In the previous discussion of this I
> noted that gocr generates random junk from logos and graphics, and XML
> tags for barcodes, and the pipeline used didn't handle multi-page
> documents:
>
> http://thread.gmane.org/gmane.comp.search.xapian.general/6336/focus
>
>
>> The one issue I find is that tesseract is sooo slow.
>>
>> One work around so ocr'ing doesn't hold up omindex would be to maintain
>> a separate instance of omindex and a separate database of ocr'd data
>> then allow them both to be searched via the "stub database" method. I'd
>> definatly wanna use last_mod patch here so I don't have to re-ocr.
>>
>> Dose this sound reasonable, if anyone has any better solutions I;d love
>> to hear of them. Once I've got it sorted I'll submit a patch. Maybe we
>> could have a flag for omindex to it knows if it's designated just to ocr
>> tiff's.
>>
>
> You don't actually need new flags for this - you can just specify -M
> flags to disable subsets of mimetypes for each indexing run.
>
> That's quiet fiddly for the "everything but tiffs", but perhaps the
> best way to deal with that is to add a "don't add the default mime
> mapping" option, rather than something very specific to OCRing tiffs.
>
>
>> I guess we could also ocr image pdf's if they comeback with no data from
>> the regular pdf filter.
>>
>
> We've been here before too - see the post linked to above.
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list