[Xapian-discuss] Omindex Filters
Frank J Bruzzaniti
frank.bruzzaniti at gmail.com
Wed Sep 17 10:50:16 BST 2008
How about XML for the output so we can incorporate any additional meta-data.
How about using tmpfs for processing could improve processing instead of
packing to disk all the time and it could be wiped before or after batches.
Olly Betts wrote:
> On Mon, Sep 15, 2008 at 01:33:45PM +0100, James Aylett wrote:
>
>> On Mon, Sep 15, 2008 at 09:59:16PM +0930, Frank J Bruzzaniti wrote:
>>
>>
>>> I was wondering if it would be a bad idea to have a way to incorporate
>>> plugins/filters in a way that would allow us to chop and change filters
>>> without having to recompile and edit the source.
>>>
>> We've discussed this in the past, and certainly I'm in favour of
>> it.
>>
>
> Yes, it would be useful.
>
> However, I think it's important to think about the right way to specify
> such filters - an "executable or script which decodes the input file as
> text to stdout" isn't all that general. Adding the apparently implicit
> assumption that the output is UTF-8, then of the existing formats
> supported, that covers only these:
>
> application/msword
> application/vnd.ms-excel
> application/vnd.ms-powerpoint
> application/vnd.ms-works
> application/vnd.wordperfect
> image/vnd.djvu
>
> I think you need to at least consider the character set and format of
> the output (plain text and HTML are common), and possibly also filters
> which can only produce output to a file, not stdout. Meta-data is
> another issue (look at the PDF handling for example).
>
> It's true that some such issues can be handled to at least some extent
> with a wrapper script around the command, but then you're adding the
> overhead of forking several extra commands per file processed, which
> is better avoided.
>
> We also don't want to encourage hacky handling of temporary files as
> that's a route straight to security bugs via symlink attacks - an
> obvious but bad approach to handling output to a file is a wrapper
> script like this one:
>
> #!/bin/sh
> foo2txt "$1" /tmp/$$.txt
> cat /tmp/$$.txt
> rm /tmp/$$.txt
>
>
>> Before going to a configuration file, I'd suggest building in a
>> CLI option to set filters, as that will require some internal plumbing
>> to get working anyway, and not having to use a configuration file is a
>> good thing IMO.
>>
>
> I don't see this as a good thing to specify on the command line. Nobody
> is going to type these in by hand for every run, so this just means you
> need a script to launch omindex with the right options. And that's just
> a config file with a nasty syntax (since shell meta characters need
> protecting). Also, if we have an explicit configuration file, we can
> look at its timestamp to check for changes to it.
>
> And it really is time that omindex finally got a proper configuration
> file.
>
> Cheers,
> Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
More information about the Xapian-discuss
mailing list