[Xapian-discuss] Omindex Filters

Thu Sep 18 04:05:01 BST 2008

On Wed, Sep 17, 2008 at 10:11:28AM +0200, Jean-Francois Dockes wrote:
> Olly Betts writes:
>  > I think you need to at least consider the character set and format of
>  > the output (plain text and HTML are common), and possibly also filters
>  > which can only produce output to a file, not stdout.  Meta-data is
>  > another issue (look at the PDF handling for example).
> 
> For what it's worth, the way Recoll handles this is to have all external
> filters output HTML (using a wrapper script in most cases). Character set
> and meta data information is issued as usual in the head section.

I've looked at Recoll's filters, but I'm not sure I like the idea of
forcing text to be converted to HTML and back in the common case where
the external filter program produces plain text.

>  > It's true that some such issues can be handled to at least some extent
>  > with a wrapper script around the command, but then you're adding the
>  > overhead of forking several extra commands per file processed, which
>  > is better avoided.
> 
> One can't but agree with this. But the kind of document types which would
> use an external filter are either unusual or heavy-weigth (the rest can
> stay in-process). Executing a few additional commands for these may prove
> not to be a major issue.

What may be unusual to you or me is likely to be common to someone else.
By the nature of document formats, people will tend to have a lot of
documents in the same format.

I had a crude stab at measuring the overhead - I took the rcldoc filter
from Recoll 1.3.3 (just because I happen to have that source tree
unpacked already) and timed it unpacking the same 300KB word document
(a random scientific paper) 800 times like so:

time (for a in `seq 1 800` ; do recoll-1.3.3/filters/rcldoc BT_v5.doc >/dev/null;done)

With a warm cache, this gave these timings for 3 runs:

real	0m13.265s
user	0m12.201s
sys	0m2.380s

real	0m13.344s
user	0m12.125s
sys	0m2.396s

real	0m13.476s
user	0m12.233s
sys	0m2.420s

Repeat with just antiword, using the same options rcldoc calls it with:

time (for a in `seq 1 800` ; do antiword -t -i 1 -m UTF-8 BT_v5.doc >/dev/null;done)

real	0m9.396s
user	0m8.305s
sys	0m0.816s

real	0m9.477s
user	0m8.309s
sys	0m0.892s

real	0m9.443s
user	0m8.285s
sys	0m0.804s

Conclusion - rcldoc is 42% slower, and I've not factored in the extra
time omindex would need to spend parsing the HTML.  Now I understand
that doc isn't a trivial to parse format, so I think this crude test
is indicative.  Also, I did it on Linux which has a low process start
overhead.  On cygwin this would be much worse.

I'm not trying to knock Recoll (or Estraier which seems to be where
these filters originated) here, just pointing out why I have
reservations about the approach.

>  > We also don't want to encourage hacky handling of temporary files as
>  > that's a route straight to security bugs via symlink attacks - an
>  > obvious but bad approach to handling output to a file is a wrapper
>  > script like this one:
>  > 
>  >     #!/bin/sh
>  >     foo2txt "$1" /tmp/$$.txt
>  >     cat /tmp/$$.txt
>  >     rm /tmp/$$.txt
> 
> You can do this without the shell too :)

You can, but if temporary file handling is implemented just once in
omindex, then we only need to get it right once.  It's also easier
to do portably in C/C++ than shell (since not all platforms have the
mktemp command).

I keep an eye on the Debian and Ubuntu security updates, and it seems
that the majority of symlink attack vulnerabilities are in external
helper scripts.  Possibly that's skewed by where people look for such
bugs though.

Cheers,
    Olly