[Xapian-discuss] Dealing with image PDF's

Thu Jul 31 15:27:30 BST 2008

Yeah I was using gocr as a test I guess, once ocropus and tesseract 
merge I'd like to give that ago as it performs much better.

I wanted to use the OCR as a last ditch effort better to index something 
from an image PDF or TIFF then not index it at all.

I'll have a play around with it, thank you for your help and suggestions

Olly Betts wrote:
> On Thu, Jul 31, 2008 at 04:09:39AM +0930, Frank Bruzzaniti wrote:
>   
>> I was just playing around and added a bit of code to omindex.cc so I 
>> could ocr tiff and tif with gocr which seems to work. Here's what it 
>> looks like:
>>
>>  // Tiff:
>>     } else if (startswith(mimetype, "image/tif"))
>>     
>
> Just test (mimetype == "image/tiff") instead -- image/tif is just incorrect.
>
>   
>>     {
>>     // Inspired by http://mjr.towers.org.uk/comp/sxw2text
>>     
>
> This comment is not relevant here.
>
>   
>>     string safefile = shell_protect(file);
>>     string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
>>     try {
>>         dump = stdout_to_string(cmd);
>>     } catch (ReadError) {
>>         cout << "\"" << cmd << "\" failed - skipping\n";
>>         return;
>>     }
>>     // Tiff:End
>>     
>
> Interesting idea!  I tried it on the TIFF files I have here.  The
> problems I noticed:
>
> * On the TIFF icons I have from various packages, I get random junk from
>   the OCR software, which we don't really want to be indexing.  I couldn't
>   see an obvious option to tell gocr to "give up if there's nothing
>   which looks like text".  Logos and graphics on pages of text also lead
>   to random junk so perhaps a filtering step to drop it would be better
>   anyway.
>
> * On the multi-page scanned document I have, I only get the text from
>   the first page.  I guess that's tifftopnm, but it doesn't seem to have
>   an option to do "all pages".  Perhaps something else to do this
>   conversion would be better?
>
> * It OCRed a barcode in my document, which is cute, but we don't really
>   want to index the XML-like tag as plain text:
>
>   <barcode type="39" chars="12" code="*N04456664M*" crc="E" error="0.049" />
>  
>   
>> I don't really understand all the code in omindex.cc but was wondering 
>> if I could OCR when no text was returned while trying to process PDF's 
>> as a way of dealing with image only PDF's.
>>
>> Here's the bit in omindex.cc that deals with pdf's:
>>
>> } else if (mimetype == "application/pdf") {
>>     string safefile = shell_protect(file);
>>     string cmd = "pdftotext -enc UTF-8 " + safefile + " -";
>>     try {
>>         dump = stdout_to_string(cmd);
>>     } catch (ReadError) {
>>         cout << "\"" << cmd << "\" failed - skipping\n";
>>         return;
>>     }
>>     
>
> And then:
>
>     if (dump.empty()) {
> 	// Do the OCR thing...
>     }
>
> Or if you get can get an "empty" dump which actually just has whitespace
> in then:
>
>     if (dump.find_first_not_of(" \n\t") == string::npos) {
> 	// Do the OCR thing...
>     }
>
>   
>> I wanted to change it so if nothing (or no strings) was returned from 
>> "pdftotext -enc UTF-8 " + safefile + " -";   then run "pdftoppm " + 
>> safefile + " | gocr -f UTF8 -";
>>     
>
> pdftoppm seems to produce one ppm file per page, rather than output on
> stdout, so you'll need to extract to a temporary directory and then
> read files from it.  See the PostScript handling code for how to work
> with a temporary directory.
>
>   
>> P.S. I was able to write similar snippets of code to process docx and 
>> xlsx, so far so good, if they test ok should I post them somewhere or 
>> email them to someone?
>>     
>
> Creating a new trac ticket and attaching the patch is probably best.
>
> If you've not already done so, take a look at:
>
> http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat
>
> As that says, it's very helpful if you can update the documentation to
> cover the new format(s) and supply some sample files which we can
> redistribute for testing (my hope is to create an automated test suite
> for omindex).
>
> Cheers,
>     Olly
>