[Xapian-discuss] Multilingual issues with Xapian

Thu Oct 11 01:09:10 BST 2007

Trying to allow searching for multilingual text with Xapian, I've indexed a document in UTF, and it was indexed correctly.
The document contains a Hebrew word: רוצה

When trying to query it, I face a problem. See the below debug info..
	#: query: רוצה
	#: Enquire: Xapian::Enquire(Database(), Xapian::Query(Zרוצה:(pos=1)))
	#: MSET: Xapian::MSet(Xapian::MSet::Internal(firstitem=0, matches_lower_bound=0, matches_estimated=0, matches_upper_bound=0, max_possible=74.602774741557425386, max_attained=0))

During query, the stemming policy is STEM_SOME. Turning stemming to STEM_NONE gives the desired results. (Notice that Z was dropped from the word as a prefix)
	#: Enquire: Xapian::Enquire(Database(), Xapian::Query(רוצה:(pos=1)))

What happens, is that the stemmer is stemming the Hebrew word, even though its not an English word.

During indexing, the stemmer was not set since the text is non-english.
While indexing I am using the correct stemmer on each document based on the document's language, so that English documents use "english" and Russian ones use "russian" etc.. Documents in a language without an existing stemmer are indexed without stemming.

Naturally if I stem documents during indexing to the proper language, such stemming should be done on words during searching. So that if a user is searching for a Russian word, the russian stemming should be applied on the word, and not english. 
However while it is relatively easy/possible to detect a language of a document, a single word's language is not so simple/possible.
It is also possible that someone will mix and match words from multiple languages in the same search phrase, for example 'Nikon למכירה' (which means 'Nikon for sale').
Of course this can happen in a document as well.

While Hebrew has no stemmer currently, the problem is the same with Russian and English for example.

So here is a thought...

What if instead of stemming all the words in a document, even if they have no real stemmed form, the stemmer (during indexing) was to stem only words that it knows having a stemmed form?
AND, It will be possible to indicate a prioritized chain of languages for stemming so that if a word has no stemmed form in the first language, it will only then try the second language and so on (this mechanism will allow mixed languages stemming/indexing).

So in the following 2 examples:

Example 1:
	Document: [Imaginary word abracadabra]
Would be indexed as:  
	Index: [Zimagin Zword imaginary word abracadabra]
and NOT as 
	Index: [Zimagin Zword Zabracadabra imaginary word abracadabra]

Example 2:
	Document: [רוצה Nikon]
Would be indexed as:  
	Index: [Znikon Zרוצ nikon רוצה]
and NOT as 
	Index: [Znikon Zרוצה nikon רוצה]

Even without the multiple language stemming, if the stemmer doesn't try to stem words it doesn't have a stemmed form for, it would solve the problem as a word רוצה which it doesn't recognize, will be both indexed AND parsed/searched in the unstemmed original form.

Your thoughts and feedback are much appreciated..

Ron