[Xapian-discuss] index everything? (no extensions/no mime-types)

Jeremy C. Reed reed at reedmedia.net
Wed Mar 2 20:02:33 GMT 2011


On Sun, 20 Feb 2011, Olly Betts wrote:

> On Sat, Feb 19, 2011 at 01:21:49PM -0600, Jeremy C. Reed wrote:
> > I have around 550,000 files (4.7GB) that I need to index. It is a huge 
> > mix of file types. I don't need access to this via web. I just use for 
> > research locally. For now I do a grep and wait several minutes.
> > 
> > omindex complains of
> > 
> > 	Unknown extension: .... - skipping
> > 
> > As I have many thousands of files that don't have extensions. (No 
> > Period.)
> > 
> > Any way to use omindex to index regardless of the extensions? Maybe just 
> > use are plain text or run strings on them?
> 
> You can set a mapping for no extension - e.g. to treat as plain text:
> 
>     -M:text/plain
> 
> I think that should work with any Omega version you're likely to be
> using.

Thanks.

> New in 1.2.4, Omega can use libmagic to detect the content-type of files
> which there's no extensions mapping for.  This is enabled if the
> libmagic development files are found, so install those if building from
> source, or if using a package, politely ask your packager to ensure
> libmagic is installed when building (the Debian and Ubuntu packages have
> this enabled).

Okay I am using this now. Thanks.

> 1.2.4 also adds a way to specify filters on the command line, so you can
> set a mapping for no extension with:
> 
>     -M:application/octet-stream
> 
> and then tell omindex to run "strings -n8" on such files using:
> 
>     --filter=application/octet-stream:'strings -n8'
> 
> There isn't a way to set a content-type regardless of extension
> currently.  Not sure that I can see a good use case for that.

I have maybe over a hundred different unknown MIME types (troff, x-tex,   
pascal, fortran, x-c, x-c++, and much more) and I am sure it will
change.

If it is unknown I want it to fall back to just assume it is text or at
least run strings on it.

I need everything that might have text in it indexed (so I can skip
images, videos, sound files).

> You also can't clear all the mappings except by passing -Mtext/html:
> etc for every content-type which has a mapping by default, which is
> very cumbersome.  Removing all the mappings means libmagic would be used
> on all files, so it might be useful to have a simple way to achieve this.




More information about the Xapian-discuss mailing list