[Xapian-tickets] [Xapian] #569: Generate omindex docs and code relating to file types (was: omindex --help text for -F misleading)
Xapian
nobody at xapian.org
Fri Nov 6 00:38:36 GMT 2015
#569: Generate omindex docs and code relating to file types
--------------------+-----------------------------
Reporter: catkin | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.4.x
Component: Omega | Version: 1.2.5
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
--------------------+-----------------------------
Old description:
> From the omindex man page:
>
> {{{
> -F, --filter=TYPE:CMD
> process files with MIME Content-Type TYPE using command CMD, which
> should produce UTF-8 text on stdout e.g. -Fapplica‐tion/octet-
> stream:'strings -n8
> }}}
>
> This could be understood to mean that omindex examines files to determine
> their MIME type (I understood it that way) but from Olly's posting,
> subject "Re: [Xapian-discuss] Tika 0.8 failure rates", date 5oct11:
>
> By default, omindex currently uses a list of extension->MIME
> content-type mappings, and only consults the magic library for
> extensions it doesn't know. So any file with a .doc extension will be
> considered as application/msword (unless you run omindex with
> '--mime-type=doc:').
>
> A note about this could be added to the omindex man page and referenced
> from the -F and -M options descriptions.
New description:
We should try to generate all the docs and code relating to file types
from a common source to ensure they stay in step with one another.
----
''Original description:''
From the omindex man page:
{{{
-F, --filter=TYPE:CMD
process files with MIME Content-Type TYPE using command CMD, which
should produce UTF-8 text on stdout e.g. -Fapplica‐tion/octet-
stream:'strings -n8
}}}
This could be understood to mean that omindex examines files to determine
their MIME type (I understood it that way) but from Olly's posting,
subject "Re: [Xapian-discuss] Tika 0.8 failure rates", date 5oct11:
By default, omindex currently uses a list of extension->MIME
content-type mappings, and only consults the magic library for
extensions it doesn't know. So any file with a .doc extension will be
considered as application/msword (unless you run omindex with
'--mime-type=doc:').
A note about this could be added to the omindex man page and referenced
from the -F and -M options descriptions.
--
Comment (by olly):
I've amended the help text in
[1c687618afcbc8e7163d3b8f15f0887c7cec71cc/git] and current master says
(`--filter` has since gained support for character encodings other than
UTF-8 and for HTML output):
{{{
-M, --mime-type=EXT:TYPE assume any file with extension EXT has MIME
Content-Type TYPE, instead of using libmagic
(empty TYPE removes any existing mapping for
EXT)
-F, --filter=M[,[T][,C]]:CMD
process files with MIME Content-Type M using
command CMD, which produces output (on stdout
or
in a temporary file) with format T (Content-
Type
or file extension; currently txt (default) or
html) in character encoding C (default:
UTF-8).
E.g. -Fapplication/octet-stream:'strings -n8'
or -Ftext/x-foo,,utf-16:'foo2utf16 %f %t'
}}}
I think that deals with the original report, so retitling.
--
Ticket URL: <http://trac.xapian.org/ticket/569#comment:13>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list