[Xapian-discuss] [Xapian-devel] Dealing with image PDF's

Olly Betts olly at survex.com
Thu Jul 31 12:54:47 BST 2008


Folks, let's not send this to *both* the mailing lists.  Arguably it
isn't out of place on either, so I'm going to set replies to
xapian-discuss since that is probably mostly a superset and the thread
originated there.  Please reply to other messages to just
xapian-discuss (and I'm sending this to both lists to make sure
everyone sees this note!)

On Thu, Jul 31, 2008 at 09:55:15AM +0100, Richard Boulton wrote:
> Reini Urban wrote:
> > 2008/7/30 Frank Bruzzaniti <frank.bruzzaniti at gmail.com>:
> >>    string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
> > 
> > Can we finally please use configure checks for such weird helper apps,
> > to avoid runtime exceptions were the system clearly has no such app.
> > 
> > I once provided a huge patch to to do that.
> > http://thread.gmane.org/gmane.comp.search.xapian.devel/783/
> 
> Perhaps the patch should go in a ticket; that way, we're less likely to 
> forget about it.

There's already this, which I think is from an earlier version of
Reini's patch which was pasted to the wiki:

http://trac.xapian.org/ticket/282

I recently updated it to apply to SVN trunk, stripping out the features
which had already gone into Omega, dropping random unused bits of code,
and cleaning it up a bit, but it's not really in shape for applying to
a release yet.

But indeed the best place to put a patch you want applied is the bug
tracker - just sending it to the mailing list or pasting it into the
wiki means we are more likely to overlook it.  By all means send stuff
to the list for discussion first if that seems best.

In this particular case, the patch was described by its author as "not
yet complete", and I replied with some comments which were never
responded to, so I'm not surprised it's languished.

In general, it's more helpful to send a patch with a particular purpose
too.  Patches which do several things, some of which are only partially
implemented aren't going to just get applied and are more work to
respond to.  A simple patch which does something useful in a good way
can usually just be applied.

> > Applied to 1.0.5 it is attached. But there's much more in this patch
> > so some parts may be stripped. See ChangeLog.
> > TEXTCAT support for language and charset detection, cached virtual
> > directories (zip,msg,pst,...) to name a few. Works fine for me for two
> > years and I haven't touched
> > it since 0.9.6.
> 
> Sounds useful.  However, I'm not sure that configure time is the right 
> place to check for the existence of helper apps.  In particular, quite 
> often omindex is installed from a pre-compiled package (for example, in 
> Debian), and the helper apps present at configure time need therefore 
> bear no relation to those present at runtime.

Indeed, as I noted at the time.

> Perhaps omindex could be improved to handle missing helper applications 
> - I've not actually looked at how it handles this recently, so I don't 
> know if there's actually a problem, but if there is, the correct fix 
> seems to me to be to handle missing helper applications gracefully, 
> rather than disable them at configure time.  Perhaps omindex would keep 
> a cache, during each run, of the helper applications which have been 
> found to be missing, so it would only attempt to run each one once.

It essentially already does.

The current behaviour (implemented in response to that patch I believe)
is that we detect when a filter has failed to run because it wasn't
installed and remove the mime-map entry that caused us to try to run it
(for that invocation of omindex).

So if we see a doc file, we'll try antiword.  If it's not installed,
we delete the mimetype entry for doc files and so skip any further doc
files installed.  So we only try to run each non-present filter once
(actually, once per mimetype which is slightly less good, but could be
improved - the hard bit to deal with is when two programs are used in a
pipe and you don't know which failed).

Cheers,
    Olly



More information about the Xapian-discuss mailing list