[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic

Xapian nobody at xapian.org
Thu Dec 15 11:09:47 GMT 2016


#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
 Reporter:  Kelson       |             Owner:  olly
     Type:  defect       |            Status:  assigned
 Priority:  normal       |         Milestone:  1.4.2
Component:  Library API  |           Version:  1.4.1
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  Linux
-------------------------+-----------------------------

Comment (by assem):

 the normalization done within the ARABIC stemmer are about :

 - Strip vocalization marks
 - Convert Eastern Arabic numerals (٠١٢٣٤٥٦٧٨٩) to Western Arabic numerals
 (0123456789)
 - Convert shaped letters to their independent form unicode
 - Separate LAM-ALEF into independent LAM and ALEF (Some systems still
 saving them as a single symbol)
 - Remove Kashida == ARABIC TATWEEL
 {{{
  '{_}' ( delete ) // strip kasheeda
 }}}

 - Remove punctuation marks

 {{{
                 // Punctuation marks
                 '.' ',' ';' ':'  '?' '!' '/' '*' '%' '\' '"' ( delete) //
 General
                 '{,}' '{;}' '{?}'  ( delete ) // Arabic-specific
 }}}

 -----
 For punctuation marks, suggest what you think it's better to keep them.
 Just a note, in Arabic we generally use `,` for decimal mark  which is not
 used as a punctuation (`،` is the comma). Yet, some are just using the
 English decimal mark `.`.

--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:8>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list