[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic
Xapian
nobody at xapian.org
Fri Dec 16 06:57:08 GMT 2016
#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
Reporter: Kelson | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.4.2
Component: Library API | Version: 1.4.1
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: Linux
-------------------------+-----------------------------
Comment (by olly):
> To handle this outside the stemmer I think it's best to just quietly
ignore empty stems.
Implemented in b717dc0f3b9074cec38bc3f1cb0dff778bd44b73 on git master.
Needs backporting for 1.4.2, and maybe considering for the next 1.2.x
release (1.2.x doesn't have the arabic stemmer, but does support user
stemmers).
> For punctuation marks, suggest what you think it's better to keep them
I think it's probably better for the stemmers to leave punctuation alone
as a general rule - the tokeniser should already have handled removing it
where it isn't wanted. There may be a few language-specific exceptions
for special cases (the English stemmer has some special handling for a
`'s` suffix for example - e.g. "king's" and "king" really should be
conflated).
--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:9>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list