[Xapian-tickets] [Xapian] #750: TermGenerator do not stop stemmed term.
Xapian
nobody at xapian.org
Fri Jul 28 05:24:52 BST 2017
#750: TermGenerator do not stop stemmed term.
-------------------------+---------------------------------
Reporter: mgautier | Owner: samuelharden
Type: defect | Status: new
Priority: normal | Milestone:
Component: QueryParser | Version: git master
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+---------------------------------
Comment (by olly):
The patch which exposed this functionality unfortunately mis-documented
what two of the three options actually do (`STOP_NONE` is OK, the others
aren't).
The stopper is expected to always be fed the unstemmed form (it takes a
'''word''' not a '''stem'''). Passing stemmed forms to a stopper which is
checking a list of words, seem a bad idea. The stemmer maps words to
stems, and the two are really separate spaces (in some cases, the stem
happens to be the same string as one of the words which stems to it, but
that doesn't mean the stemmer is mapping words to words). So for example,
using the English stemmer, the word "tease" has the stem "teas". But
that's nothing to do with the word "teas" (which has the stem "tea").
`STOP_STEMMED` is actually "check the unstemmed form with the stopper, and
if it's a stop word, only index its unstemmed form" - this is a useful
thing to do because it means searches for phrases which include stopwords
work (the unstemmed forms are indexed with positional information).
`STOP_ALL` is actually "check the unstemmed form with the stopper, and if
it's a stop word, skip the word". At least in English there are cases
where a word has multiple meanings, and only one is really a stopword.
For example, "can" would probably be on an English stopword list, because
it's a form of the irregular verb meaning "to be able to". But it's also
a noun (a metal container) and a different regular verb (meaning to put
something in such a metal container), etc, and those words shouldn't
really be stopwords. So while "cans" and "canned" also stem the same way
as "can", it's unhelpful to treat them as stopwords too.
If you use the same stopper when parsing queries, this should work nicely
- "can" will also be treated as a stop word in queries, but a search for
"canned" will still match "canned" or "cans" in documents.
English is particularly rife with words with lots of different meanings,
and I'm not sure how common this situation is in other languages, but as
best I can make out your example "lea" is actually a name
(https://en.wiktionary.org/wiki/L%C3%A9a) which happens to stem to the
same thing as the article "le", in which case I'd argue that "lea" really
shouldn't be treated as a stopword either.
Given `STOP_STEMMED` is the default, and before this patch it was long-
established as the hard-coded behaviour when a stopper was set, changing
what it means now to try to match what the current API documentation says
would be unhelpful, and I think fixing the documentation makes most sense.
You can actually already stop any word which stems the same way as a
stopword by providing a stopper which stems its input before checking it
against a list of stems of stopwords, but we could perhaps provide a mode
(or a special `Stopper` subclass) to streamline this, if it's actually a
sensible thing to be doing.
Anyway, I'm afraid the patch in that PR isn't an appropriate change.
--
Ticket URL: <https://trac.xapian.org/ticket/750#comment:5>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list