[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic
Xapian
nobody at xapian.org
Thu Dec 15 11:09:47 GMT 2016
#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
Reporter: Kelson | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.4.2
Component: Library API | Version: 1.4.1
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: Linux
-------------------------+-----------------------------
Comment (by assem):
the normalization done within the ARABIC stemmer are about :
- Strip vocalization marks
- Convert Eastern Arabic numerals (٠١٢٣٤٥٦٧٨٩) to Western Arabic numerals
(0123456789)
- Convert shaped letters to their independent form unicode
- Separate LAM-ALEF into independent LAM and ALEF (Some systems still
saving them as a single symbol)
- Remove Kashida == ARABIC TATWEEL
{{{
'{_}' ( delete ) // strip kasheeda
}}}
- Remove punctuation marks
{{{
// Punctuation marks
'.' ',' ';' ':' '?' '!' '/' '*' '%' '\' '"' ( delete) //
General
'{,}' '{;}' '{?}' ( delete ) // Arabic-specific
}}}
-----
For punctuation marks, suggest what you think it's better to keep them.
Just a note, in Arabic we generally use `,` for decimal mark which is not
used as a punctuation (`،` is the comma). Yet, some are just using the
English decimal mark `.`.
--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:8>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list