[Xapian-devel] omindex patch

Olly Betts olly at survex.com
Sat Sep 2 04:57:27 BST 2006


On Sun, Aug 20, 2006 at 08:33:56PM +0200, Reini Urban wrote:
> The patch is not yet complete.

I've had a quick look through - I think there are some useful things in
here.  Let me know when you've finished tweaking it.

> It needs autoreconf to update configure and the Makefiles.
> Note that unrar is not patent infected, only rar, the compressor.
> I've put some AC_PATH_PROG checks into configure for all helpers.

This assumes that the filters installed at configure time are the same
as those installed at run time, which isn't necessarily the case (for
binary packaged versions, it's probably rarely true).

I'd prefer to just run the filter anyway and check if it fails.  I've
just added some code to remove the ext->mime-type mapping when the
filter fails because it couldn't be found, so we now effectively lazily
probe the filters we want to use at run-time.

> +AM_LDFLAGS = -no-undefined

Sadly adding this unconditionally causes problems on some platforms (I
forget which off the top of my head).  Do you need it for cygwin?

> +#define SAMPLE_WORDS  500

This is actually the number of *CHARACTERS*, not words.

> +#ifdef HAVE_TEXTCAT
> +    char * lang;
> +    lang = textcat_Classify( textcat, sample.c_str(), sample.length()+1 );
> +    language = string(lang);
> +    if ((language != _TEXTCAT_RESULT_UNKOWN) // unknown language
> +	&& (language != _TEXTCAT_RESULT_SHORT)) // too little information
> +    {
> +	if (language[0] == '[') {
> +	    int pos = language.find(']',0);
> +	    language = language.substr(1,pos-1);
> +	}
> +	record += "\nlanguage=" + language;
> +	if (language != curr_lang)  {
> +	    cout << "new language " << curr_lang << " => " << language << " ";
> +	    stemmer = Xapian::Stem(language);
> +	    curr_lang = language;
> +	}
> +    }
> +#endif

If each document is stemmed in a potentially different language, how do
you decide which stemmer to use at query time?

Also, should documents which are categorised as "unknown" or "too short
to determine" really just get the last used language?  I can see that's
sometimes a good choice, but in other cases it can be very arbitrary.
It also means that such documents can get an entirely different langauge
in an update (because the previously processed document could be a
completely different one if only a few documents have changed).

Cheers,
    Olly



More information about the Xapian-devel mailing list