[Xapian-discuss] index everything? (no extensions/no mime-types)

Olly Betts olly at survex.com
Sun Feb 20 14:34:11 GMT 2011


On Sat, Feb 19, 2011 at 01:21:49PM -0600, Jeremy C. Reed wrote:
> I have around 550,000 files (4.7GB) that I need to index. It is a huge 
> mix of file types. I don't need access to this via web. I just use for 
> research locally. For now I do a grep and wait several minutes.
> 
> omindex complains of
> 
> 	Unknown extension: .... - skipping
> 
> As I have many thousands of files that don't have extensions. (No 
> Period.)
> 
> Any way to use omindex to index regardless of the extensions? Maybe just 
> use are plain text or run strings on them?

You can set a mapping for no extension - e.g. to treat as plain text:

    -M:text/plain

I think that should work with any Omega version you're likely to be
using.

New in 1.2.4, Omega can use libmagic to detect the content-type of files
which there's no extensions mapping for.  This is enabled if the
libmagic development files are found, so install those if building from
source, or if using a package, politely ask your packager to ensure
libmagic is installed when building (the Debian and Ubuntu packages have
this enabled).

1.2.4 also adds a way to specify filters on the command line, so you can
set a mapping for no extension with:

    -M:application/octet-stream

and then tell omindex to run "strings -n8" on such files using:

    --filter=application/octet-stream:'strings -n8'

There isn't a way to set a content-type regardless of extension
currently.  Not sure that I can see a good use case for that.

You also can't clear all the mappings except by passing -Mtext/html:
etc for every content-type which has a mapping by default, which is
very cumbersome.  Removing all the mappings means libmagic would be used
on all files, so it might be useful to have a simple way to achieve this.

Cheers,
    Olly



More information about the Xapian-discuss mailing list