[Xapian-discuss] tiff / image pdf filter

Thu Mar 19 22:42:05 GMT 2009

On Thu, Mar 19, 2009 at 03:36:47PM +1030, Frank J Bruzzaniti wrote:
> I've been experimenting using tesseract to OCR tiff's with omega just 
> using the tesseract binary package from Ubuntu.

Is tesseract better than gocr?  In the previous discussion of this I
noted that gocr generates random junk from logos and graphics, and XML
tags for barcodes, and the pipeline used didn't handle multi-page
documents:

http://thread.gmane.org/gmane.comp.search.xapian.general/6336/focus

> The one issue I find is that tesseract is sooo slow.
> 
> One work around so ocr'ing doesn't hold up omindex would be to maintain 
> a separate instance of omindex and a separate database of ocr'd data 
> then allow them both to be searched via the "stub database" method.  I'd 
> definatly wanna use last_mod patch here so I don't have to re-ocr.
> 
> Dose this sound reasonable,  if anyone has any better solutions I;d love 
> to hear of them.  Once I've got it sorted I'll submit a patch. Maybe we 
> could have a flag for omindex to it knows if it's designated just to ocr 
> tiff's. 

You don't actually need new flags for this - you can just specify -M
flags to disable subsets of mimetypes for each indexing run.

That's quiet fiddly for the "everything but tiffs", but perhaps the
best way to deal with that is to add a "don't add the default mime
mapping" option, rather than something very specific to OCRing tiffs.

> I guess we could also ocr image pdf's if they comeback with no data from 
> the regular pdf filter.

We've been here before too - see the post linked to above.

Cheers,
    Olly