[Xapian-tickets] [Xapian] #750: TermGenerator do not stop stemmed term.

Xapian nobody at xapian.org
Wed Aug 9 05:39:23 BST 2017


#750: TermGenerator do not stop stemmed term.
-------------------------+---------------------------------
 Reporter:  mgautier     |             Owner:  samuelharden
     Type:  defect       |            Status:  new
 Priority:  normal       |         Milestone:  1.4.5
Component:  QueryParser  |           Version:  git master
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+---------------------------------

Comment (by olly):

 Assuming you set the same stopper at query time, this should just work
 with those settings - testing with a slightly patched version of
 `examples/quest.cc` I get:

 {{{
 $ git diff
 diff --git a/xapian-core/examples/quest.cc b/xapian-core/examples/quest.cc
 index 9c199c340d5f..d79a7bcf414d 100644
 --- a/xapian-core/examples/quest.cc
 +++ b/xapian-core/examples/quest.cc
 @@ -37,15 +37,7 @@ using namespace std;

  // Stopwords:
  static const char * sw[] = {
 -    "a", "about", "an", "and", "are", "as", "at",
 -    "be", "by",
 -    "en",
 -    "for", "from",
 -    "how",
 -    "i", "in", "is", "it",
 -    "of", "on", "or",
 -    "that", "the", "this", "to",
 -    "was", "what", "when", "where", "which", "who", "why", "will", "with"
 +    "la", "le"
  };

  struct qp_flag { const char * s; unsigned f; };
 @@ -362,7 +354,7 @@ try {

      parser.set_database(db);
      parser.set_stemmer(stemmer);
 -    parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME);
 +    parser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL);
      parser.set_stopper(&mystopper);

      Xapian::Query query = parser.parse_query(argv[optind], flags);
 $ examples/quest --stemmer fr 'le camion'
 Parsed Query: Query(camion at 2)
 No database specified so not running the query.
 $ examples/quest --stemmer fr 'lea seydoux'
 Parsed Query: Query((le at 1 OR seydoux at 2))
 No database specified so not running the query.
 }}}

 So `le` is handled as a stopword, and `lea` is stemmed to `le` and
 included in the search.

 There's one slight wrinkle, which is that for terms where search-time
 stopwording is suppressed (e.g. because `le` is used in a phrase, or an
 individual `le` is quoted, or when the query is entirely composed of
 stopwords) then `le` in the query won't be stopped and will match `lea` in
 the document, e.g.:

 {{{
 $ examples/quest --stemmer fr '"le voiture"'
 Parsed Query: Query((le at 1 PHRASE 2 voitur at 2))
 No database specified so not running the query.
 $ examples/quest --stemmer fr '"le" voiture'
 Parsed Query: Query((le at 1 OR voitur at 2))
 No database specified so not running the query.
 $ examples/quest --stemmer fr 'le la'
 Parsed Query: Query((le at 1 OR la at 2))
 No database specified so not running the query.
 }}}

 I think to handle such cases, we'd probably need to explicitly teach
 `QueryParser` about the different stop strategies.  If we're removing
 stopwords at index time, then in cases like the above it could do
 something more appropriate.


 You could switch to `STEM_ALL_Z` - that's the default mode of operation,
 and allows for exact matching of words and exact phrase searches, which
 you can't achieve if you only index the stemmed forms.  The downside is
 that the database will be larger.

--
Ticket URL: <https://trac.xapian.org/ticket/750#comment:8>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list