[Xapian-discuss] index everything? (no extensions/no mime-types)
Olly Betts
olly at survex.com
Sun Feb 20 14:34:11 GMT 2011
On Sat, Feb 19, 2011 at 01:21:49PM -0600, Jeremy C. Reed wrote:
> I have around 550,000 files (4.7GB) that I need to index. It is a huge
> mix of file types. I don't need access to this via web. I just use for
> research locally. For now I do a grep and wait several minutes.
>
> omindex complains of
>
> Unknown extension: .... - skipping
>
> As I have many thousands of files that don't have extensions. (No
> Period.)
>
> Any way to use omindex to index regardless of the extensions? Maybe just
> use are plain text or run strings on them?
You can set a mapping for no extension - e.g. to treat as plain text:
-M:text/plain
I think that should work with any Omega version you're likely to be
using.
New in 1.2.4, Omega can use libmagic to detect the content-type of files
which there's no extensions mapping for. This is enabled if the
libmagic development files are found, so install those if building from
source, or if using a package, politely ask your packager to ensure
libmagic is installed when building (the Debian and Ubuntu packages have
this enabled).
1.2.4 also adds a way to specify filters on the command line, so you can
set a mapping for no extension with:
-M:application/octet-stream
and then tell omindex to run "strings -n8" on such files using:
--filter=application/octet-stream:'strings -n8'
There isn't a way to set a content-type regardless of extension
currently. Not sure that I can see a good use case for that.
You also can't clear all the mappings except by passing -Mtext/html:
etc for every content-type which has a mapping by default, which is
very cumbersome. Removing all the mappings means libmagic would be used
on all files, so it might be useful to have a simple way to achieve this.
Cheers,
Olly
More information about the Xapian-discuss
mailing list