[Xapian-discuss] Dealing with image PDF's
Frank John Bruzzaniti
frank.bruzzaniti at gmail.com
Thu Jul 31 15:27:30 BST 2008
Yeah I was using gocr as a test I guess, once ocropus and tesseract
merge I'd like to give that ago as it performs much better.
I wanted to use the OCR as a last ditch effort better to index something
from an image PDF or TIFF then not index it at all.
I'll have a play around with it, thank you for your help and suggestions
Olly Betts wrote:
> On Thu, Jul 31, 2008 at 04:09:39AM +0930, Frank Bruzzaniti wrote:
>
>> I was just playing around and added a bit of code to omindex.cc so I
>> could ocr tiff and tif with gocr which seems to work. Here's what it
>> looks like:
>>
>> // Tiff:
>> } else if (startswith(mimetype, "image/tif"))
>>
>
> Just test (mimetype == "image/tiff") instead -- image/tif is just incorrect.
>
>
>> {
>> // Inspired by http://mjr.towers.org.uk/comp/sxw2text
>>
>
> This comment is not relevant here.
>
>
>> string safefile = shell_protect(file);
>> string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
>> try {
>> dump = stdout_to_string(cmd);
>> } catch (ReadError) {
>> cout << "\"" << cmd << "\" failed - skipping\n";
>> return;
>> }
>> // Tiff:End
>>
>
> Interesting idea! I tried it on the TIFF files I have here. The
> problems I noticed:
>
> * On the TIFF icons I have from various packages, I get random junk from
> the OCR software, which we don't really want to be indexing. I couldn't
> see an obvious option to tell gocr to "give up if there's nothing
> which looks like text". Logos and graphics on pages of text also lead
> to random junk so perhaps a filtering step to drop it would be better
> anyway.
>
> * On the multi-page scanned document I have, I only get the text from
> the first page. I guess that's tifftopnm, but it doesn't seem to have
> an option to do "all pages". Perhaps something else to do this
> conversion would be better?
>
> * It OCRed a barcode in my document, which is cute, but we don't really
> want to index the XML-like tag as plain text:
>
> <barcode type="39" chars="12" code="*N04456664M*" crc="E" error="0.049" />
>
>
>> I don't really understand all the code in omindex.cc but was wondering
>> if I could OCR when no text was returned while trying to process PDF's
>> as a way of dealing with image only PDF's.
>>
>> Here's the bit in omindex.cc that deals with pdf's:
>>
>> } else if (mimetype == "application/pdf") {
>> string safefile = shell_protect(file);
>> string cmd = "pdftotext -enc UTF-8 " + safefile + " -";
>> try {
>> dump = stdout_to_string(cmd);
>> } catch (ReadError) {
>> cout << "\"" << cmd << "\" failed - skipping\n";
>> return;
>> }
>>
>
> And then:
>
> if (dump.empty()) {
> // Do the OCR thing...
> }
>
> Or if you get can get an "empty" dump which actually just has whitespace
> in then:
>
> if (dump.find_first_not_of(" \n\t") == string::npos) {
> // Do the OCR thing...
> }
>
>
>> I wanted to change it so if nothing (or no strings) was returned from
>> "pdftotext -enc UTF-8 " + safefile + " -"; then run "pdftoppm " +
>> safefile + " | gocr -f UTF8 -";
>>
>
> pdftoppm seems to produce one ppm file per page, rather than output on
> stdout, so you'll need to extract to a temporary directory and then
> read files from it. See the PostScript handling code for how to work
> with a temporary directory.
>
>
>> P.S. I was able to write similar snippets of code to process docx and
>> xlsx, so far so good, if they test ok should I post them somewhere or
>> email them to someone?
>>
>
> Creating a new trac ticket and attaching the patch is probably best.
>
> If you've not already done so, take a look at:
>
> http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat
>
> As that says, it's very helpful if you can update the documentation to
> cover the new format(s) and supply some sample files which we can
> redistribute for testing (my hope is to create an automated test suite
> for omindex).
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list