[Xapian-tickets] [Xapian] #569: Generate omindex docs and code relating to file types (was: omindex --help text for -F misleading)

Xapian nobody at xapian.org
Fri Nov 6 00:38:36 GMT 2015


#569: Generate omindex docs and code relating to file types
--------------------+-----------------------------
 Reporter:  catkin  |             Owner:  olly
     Type:  defect  |            Status:  assigned
 Priority:  normal  |         Milestone:  1.4.x
Component:  Omega   |           Version:  1.2.5
 Severity:  normal  |        Resolution:
 Keywords:          |        Blocked By:
 Blocking:          |  Operating System:  All
--------------------+-----------------------------

Old description:

> From the omindex man page:
>
> {{{
> -F, --filter=TYPE:CMD
>     process files with MIME Content-Type TYPE using command CMD, which
> should produce UTF-8 text on stdout e.g. -Fapplica‐tion/octet-
> stream:'strings -n8
> }}}
>
> This could be understood to mean that omindex examines files to determine
> their MIME type (I understood it that way) but from Olly's posting,
> subject "Re: [Xapian-discuss] Tika 0.8 failure rates", date 5oct11:
>
> By default, omindex currently uses a list of extension->MIME
> content-type mappings, and only consults the magic library for
> extensions it doesn't know.  So any file with a .doc extension will be
> considered as application/msword (unless you run omindex with
> '--mime-type=doc:').
>
> A note about this could be added to the omindex man page and referenced
> from the -F and -M options descriptions.

New description:

 We should try to generate all the docs and code relating to file types
 from a common source to ensure they stay in step with one another.

 ----
 ''Original description:''

 From the omindex man page:

 {{{
 -F, --filter=TYPE:CMD
     process files with MIME Content-Type TYPE using command CMD, which
 should produce UTF-8 text on stdout e.g. -Fapplica‐tion/octet-
 stream:'strings -n8
 }}}

 This could be understood to mean that omindex examines files to determine
 their MIME type (I understood it that way) but from Olly's posting,
 subject "Re: [Xapian-discuss] Tika 0.8 failure rates", date 5oct11:

 By default, omindex currently uses a list of extension->MIME
 content-type mappings, and only consults the magic library for
 extensions it doesn't know.  So any file with a .doc extension will be
 considered as application/msword (unless you run omindex with
 '--mime-type=doc:').

 A note about this could be added to the omindex man page and referenced
 from the -F and -M options descriptions.

--

Comment (by olly):

 I've amended the help text in
 [1c687618afcbc8e7163d3b8f15f0887c7cec71cc/git] and current master says
 (`--filter` has since gained support for character encodings other than
 UTF-8 and for HTML output):

 {{{
   -M, --mime-type=EXT:TYPE  assume any file with extension EXT has MIME
                             Content-Type TYPE, instead of using libmagic
                             (empty TYPE removes any existing mapping for
 EXT)
   -F, --filter=M[,[T][,C]]:CMD
                             process files with MIME Content-Type M using
                             command CMD, which produces output (on stdout
 or
                             in a temporary file) with format T (Content-
 Type
                             or file extension; currently txt (default) or
                             html) in character encoding C (default:
 UTF-8).
                             E.g. -Fapplication/octet-stream:'strings -n8'
                             or -Ftext/x-foo,,utf-16:'foo2utf16 %f %t'
 }}}

 I think that deals with the original report, so retitling.

--
Ticket URL: <http://trac.xapian.org/ticket/569#comment:13>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list