[Xapian-discuss] tiff / image pdf filter

Frank J Bruzzaniti frank.bruzzaniti at gmail.com
Thu Mar 19 05:06:47 GMT 2009


I've been experimenting using tesseract to OCR tiff's with omega just 
using the tesseract binary package from Ubuntu.

The one issue I find is that tesseract is sooo slow.

One work around so ocr'ing doesn't hold up omindex would be to maintain 
a separate instance of omindex and a separate database of ocr'd data 
then allow them both to be searched via the "stub database" method.  I'd 
definatly wanna use last_mod patch here so I don't have to re-ocr.

Dose this sound reasonable,  if anyone has any better solutions I;d love 
to hear of them.  Once I've got it sorted I'll submit a patch. Maybe we 
could have a flag for omindex to it knows if it's designated just to ocr 
tiff's. 

I guess we could also ocr image pdf's if they comeback with no data from 
the regular pdf filter. E.g. If you run omindex --tiff --ipdf then it 
will only ocr tiff's and image pdf's by emploing the regular pdf filter 
if it returns data then skip it if it dosen't then ocr it.

Frank




More information about the Xapian-discuss mailing list