[Xapian-tickets] [Xapian] #741: "Empty termnames aren't allowed" by indexing text in Arabic

Xapian nobody at xapian.org
Fri Dec 16 06:57:08 GMT 2016


#741: "Empty termnames aren't allowed" by indexing text in Arabic
-------------------------+-----------------------------
 Reporter:  Kelson       |             Owner:  olly
     Type:  defect       |            Status:  assigned
 Priority:  normal       |         Milestone:  1.4.2
Component:  Library API  |           Version:  1.4.1
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  Linux
-------------------------+-----------------------------

Comment (by olly):

 > To handle this outside the stemmer I think it's best to just quietly
 ignore empty stems.

 Implemented in b717dc0f3b9074cec38bc3f1cb0dff778bd44b73 on git master.
 Needs backporting for 1.4.2, and maybe considering for the next 1.2.x
 release (1.2.x doesn't have the arabic stemmer, but does support user
 stemmers).

 > For punctuation marks, suggest what you think it's better to keep them

 I think it's probably better for the stemmers to leave punctuation alone
 as a general rule - the tokeniser should already have handled removing it
 where it isn't wanted.  There may be a few language-specific exceptions
 for special cases (the English stemmer has some special handling for a
 `'s` suffix for example - e.g. "king's" and "king" really should be
 conflated).

--
Ticket URL: <https://trac.xapian.org/ticket/741#comment:9>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list