[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic

Xapian nobody at xapian.org
Wed Dec 14 17:43:26 GMT 2016


#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
 Reporter:  Kelson       |             Owner:  olly
     Type:  defect       |            Status:  assigned
 Priority:  normal       |         Milestone:  1.4.2
Component:  Library API  |           Version:  1.4.1
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  Linux
-------------------------+-----------------------------

Comment (by assem):

 @Olly

 The arabic stemmer do the normalization before the stemming , that's why
 `'ARABIC TATWEEL' (U+0640) character` which is used to make the words
 longer without losing the shape of word. In the wrong.txt case, the
 Tatweel came strangely alone (confused for dash or underscore). I think an
 alone TATWEEL should treated like alone dot ".", should never tokenized as
 an independent term.

 What you suggest as solution for this, doing normalization before
 tokenization?  or edit tokenization to not generate the term in the first
 place.

--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:6>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list