[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic
Xapian
nobody at xapian.org
Wed Dec 14 17:43:26 GMT 2016
#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
Reporter: Kelson | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.4.2
Component: Library API | Version: 1.4.1
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: Linux
-------------------------+-----------------------------
Comment (by assem):
@Olly
The arabic stemmer do the normalization before the stemming , that's why
`'ARABIC TATWEEL' (U+0640) character` which is used to make the words
longer without losing the shape of word. In the wrong.txt case, the
Tatweel came strangely alone (confused for dash or underscore). I think an
alone TATWEEL should treated like alone dot ".", should never tokenized as
an independent term.
What you suggest as solution for this, doing normalization before
tokenization? or edit tokenization to not generate the term in the first
place.
--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:6>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list