[Xapian-discuss] Omindex Filters

Olly Betts olly at survex.com
Wed Sep 17 05:40:18 BST 2008


On Mon, Sep 15, 2008 at 01:33:45PM +0100, James Aylett wrote:
> On Mon, Sep 15, 2008 at 09:59:16PM +0930, Frank J Bruzzaniti wrote:
> 
> > I was wondering if it would be a bad idea to have a way to incorporate 
> > plugins/filters in a way that would allow us to chop and change filters 
> > without having to recompile and edit the source.
> 
> We've discussed this in the past, and certainly I'm in favour of
> it.

Yes, it would be useful.

However, I think it's important to think about the right way to specify
such filters - an "executable or script which decodes the input file as
text to stdout" isn't all that general.  Adding the apparently implicit
assumption that the output is UTF-8, then of the existing formats
supported, that covers only these:

	application/msword
	application/vnd.ms-excel
	application/vnd.ms-powerpoint
	application/vnd.ms-works
	application/vnd.wordperfect
	image/vnd.djvu

I think you need to at least consider the character set and format of
the output (plain text and HTML are common), and possibly also filters
which can only produce output to a file, not stdout.  Meta-data is
another issue (look at the PDF handling for example).

It's true that some such issues can be handled to at least some extent
with a wrapper script around the command, but then you're adding the
overhead of forking several extra commands per file processed, which
is better avoided.

We also don't want to encourage hacky handling of temporary files as
that's a route straight to security bugs via symlink attacks - an
obvious but bad approach to handling output to a file is a wrapper
script like this one:

    #!/bin/sh
    foo2txt "$1" /tmp/$$.txt
    cat /tmp/$$.txt
    rm /tmp/$$.txt

> Before going to a configuration file, I'd suggest building in a
> CLI option to set filters, as that will require some internal plumbing
> to get working anyway, and not having to use a configuration file is a
> good thing IMO.

I don't see this as a good thing to specify on the command line.  Nobody
is going to type these in by hand for every run, so this just means you
need a script to launch omindex with the right options.  And that's just
a config file with a nasty syntax (since shell meta characters need
protecting).  Also, if we have an explicit configuration file, we can 
look at its timestamp to check for changes to it.

And it really is time that omindex finally got a proper configuration
file.

Cheers,
    Olly



More information about the Xapian-discuss mailing list