[Xapian-tickets] [Xapian] #750: TermGenerator do not stop stemmed term.

Fri Jul 28 05:24:52 BST 2017

#750: TermGenerator do not stop stemmed term.
-------------------------+---------------------------------
 Reporter:  mgautier     |             Owner:  samuelharden
     Type:  defect       |            Status:  new
 Priority:  normal       |         Milestone:
Component:  QueryParser  |           Version:  git master
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+---------------------------------

Comment (by olly):

 The patch which exposed this functionality unfortunately mis-documented
 what two of the three options actually do (`STOP_NONE` is OK, the others
 aren't).

 The stopper is expected to always be fed the unstemmed form (it takes a
 '''word''' not a '''stem''').  Passing stemmed forms to a stopper which is
 checking a list of words, seem a bad idea.  The stemmer maps words to
 stems, and the two are really separate spaces (in some cases, the stem
 happens to be the same string as one of the words which stems to it, but
 that doesn't mean the stemmer is mapping words to words).  So for example,
 using the English stemmer, the word "tease" has the stem "teas".  But
 that's nothing to do with the word "teas" (which has the stem "tea").

 `STOP_STEMMED` is actually "check the unstemmed form with the stopper, and
 if it's a stop word, only index its unstemmed form" - this is a useful
 thing to do because it means searches for phrases which include stopwords
 work (the unstemmed forms are indexed with positional information).

 `STOP_ALL` is actually "check the unstemmed form with the stopper, and if
 it's a stop word, skip the word".  At least in English there are cases
 where a word has multiple meanings, and only one is really a stopword.
 For example, "can" would probably be on an English stopword list, because
 it's a form of the irregular verb meaning "to be able to".  But it's also
 a noun (a metal container) and a different regular verb (meaning to put
 something in such a metal container), etc, and those words shouldn't
 really be stopwords.  So while "cans" and "canned" also stem the same way
 as "can", it's unhelpful to treat them as stopwords too.

 If you use the same stopper when parsing queries, this should work nicely
 - "can" will also be treated as a stop word in queries, but a search for
 "canned" will still match "canned" or "cans" in documents.

 English is particularly rife with words with lots of different meanings,
 and I'm not sure how common this situation is in other languages, but as
 best I can make out your example "lea" is actually a name
 (https://en.wiktionary.org/wiki/L%C3%A9a) which happens to stem to the
 same thing as the article "le", in which case I'd argue that "lea" really
 shouldn't be treated as a stopword either.

 Given `STOP_STEMMED` is the default, and before this patch it was long-
 established as the hard-coded behaviour when a stopper was set, changing
 what it means now to try to match what the current API documentation says
 would be unhelpful, and I think fixing the documentation makes most sense.

 You can actually already stop any word which stems the same way as a
 stopword by providing a stopper which stems its input before checking it
 against a list of stems of stopwords, but we could perhaps provide a mode
 (or a special `Stopper` subclass) to streamline this, if it's actually a
 sensible thing to be doing.

 Anyway, I'm afraid the patch in that PR isn't an appropriate change.

--
Ticket URL: <https://trac.xapian.org/ticket/750#comment:5>
Xapian <https://xapian.org/>
Xapian