[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic

Xapian nobody at xapian.org
Thu Dec 15 05:33:15 GMT 2016


#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
 Reporter:  Kelson       |             Owner:  olly
     Type:  defect       |            Status:  assigned
 Priority:  normal       |         Milestone:  1.4.2
Component:  Library API  |           Version:  1.4.1
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  Linux
-------------------------+-----------------------------

Comment (by olly):

 Thanks for the insights - it's not a problem if a lone tatweel is ignored
 then.

 Unicode classes U+0640 as "Letter, Modifier" (Lm) and the tokeniser treats
 all subclasses of "Letter" the same way.  It could reject words that are
 comprised entirely of modifiers, but it looks like
 [http://www.unicode.org/reports/tr29/tr29-29.html UAX#29 (Unicode Text
 Segmentation)] also treats Lm the same as other Letter subcategories, and
 a quick experiment with ICU on `"x ـ x"` seems to confirm this.  It seems
 rash to deviate from that without knowing a lot more about the details of
 the various Lm characters in all various different scripts than I do.
 (Also it would mean that "cute" stuff like ᴾᴼᴿᵀᴱᴿ wouldn't be indexed!)

 To handle this outside the stemmer I think it's best to just quietly
 ignore empty stems.

 Looking at the Arabic algorithm, I notice it seems overly aggressive in
 its removal of non-letters in general, unlike the other snowball stemmers
 which generally leave non-words alone (your example of `.` stems to `.`
 with all the other language stemmers I tried).  While pure punctuation
 strings are perhaps a bit esoteric, leaving non-words alone generally
 seems a sensible approach - one problematic case with the current Arabic
 stemming algorithm is that real numbers lose their decimal point - e.g.
 `20.16` -> `2016`

--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:7>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list