[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic
Xapian
nobody at xapian.org
Thu Dec 15 05:33:15 GMT 2016
#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
Reporter: Kelson | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.4.2
Component: Library API | Version: 1.4.1
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: Linux
-------------------------+-----------------------------
Comment (by olly):
Thanks for the insights - it's not a problem if a lone tatweel is ignored
then.
Unicode classes U+0640 as "Letter, Modifier" (Lm) and the tokeniser treats
all subclasses of "Letter" the same way. It could reject words that are
comprised entirely of modifiers, but it looks like
[http://www.unicode.org/reports/tr29/tr29-29.html UAX#29 (Unicode Text
Segmentation)] also treats Lm the same as other Letter subcategories, and
a quick experiment with ICU on `"x ـ x"` seems to confirm this. It seems
rash to deviate from that without knowing a lot more about the details of
the various Lm characters in all various different scripts than I do.
(Also it would mean that "cute" stuff like ᴾᴼᴿᵀᴱᴿ wouldn't be indexed!)
To handle this outside the stemmer I think it's best to just quietly
ignore empty stems.
Looking at the Arabic algorithm, I notice it seems overly aggressive in
its removal of non-letters in general, unlike the other snowball stemmers
which generally leave non-words alone (your example of `.` stems to `.`
with all the other language stemmers I tried). While pure punctuation
strings are perhaps a bit esoteric, leaving non-words alone generally
seems a sensible approach - one problematic case with the current Arabic
stemming algorithm is that real numbers lose their decimal point - e.g.
`20.16` -> `2016`
--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:7>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list