[Xapian-discuss] Omindex Filters

Jean-Francois Dockes jean-francois.dockes at wanadoo.fr
Wed Sep 17 09:11:28 BST 2008


Olly Betts writes:
 > I think you need to at least consider the character set and format of
 > the output (plain text and HTML are common), and possibly also filters
 > which can only produce output to a file, not stdout.  Meta-data is
 > another issue (look at the PDF handling for example).

For what it's worth, the way Recoll handles this is to have all external
filters output HTML (using a wrapper script in most cases). Character set
and meta data information is issued as usual in the head section.

 > It's true that some such issues can be handled to at least some extent
 > with a wrapper script around the command, but then you're adding the
 > overhead of forking several extra commands per file processed, which
 > is better avoided.

One can't but agree with this. But the kind of document types which would
use an external filter are either unusual or heavy-weigth (the rest can
stay in-process). Executing a few additional commands for these may prove
not to be a major issue.

 > We also don't want to encourage hacky handling of temporary files as
 > that's a route straight to security bugs via symlink attacks - an
 > obvious but bad approach to handling output to a file is a wrapper
 > script like this one:
 > 
 >     #!/bin/sh
 >     foo2txt "$1" /tmp/$$.txt
 >     cat /tmp/$$.txt
 >     rm /tmp/$$.txt

You can do this without the shell too :)

jf



More information about the Xapian-discuss mailing list