[Xapian-discuss] tiff / image pdf filter

Thu Mar 19 05:06:47 GMT 2009

I've been experimenting using tesseract to OCR tiff's with omega just 
using the tesseract binary package from Ubuntu.

The one issue I find is that tesseract is sooo slow.

One work around so ocr'ing doesn't hold up omindex would be to maintain 
a separate instance of omindex and a separate database of ocr'd data 
then allow them both to be searched via the "stub database" method.  I'd 
definatly wanna use last_mod patch here so I don't have to re-ocr.

Dose this sound reasonable,  if anyone has any better solutions I;d love 
to hear of them.  Once I've got it sorted I'll submit a patch. Maybe we 
could have a flag for omindex to it knows if it's designated just to ocr 
tiff's. 

I guess we could also ocr image pdf's if they comeback with no data from 
the regular pdf filter. E.g. If you run omindex --tiff --ipdf then it 
will only ocr tiff's and image pdf's by emploing the regular pdf filter 
if it returns data then skip it if it dosen't then ocr it.

Frank