[Xapian-tickets] [Xapian] #750: TermGenerator do not stop stemmed term.
Xapian
nobody at xapian.org
Wed Aug 9 05:39:23 BST 2017
#750: TermGenerator do not stop stemmed term.
-------------------------+---------------------------------
Reporter: mgautier | Owner: samuelharden
Type: defect | Status: new
Priority: normal | Milestone: 1.4.5
Component: QueryParser | Version: git master
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+---------------------------------
Comment (by olly):
Assuming you set the same stopper at query time, this should just work
with those settings - testing with a slightly patched version of
`examples/quest.cc` I get:
{{{
$ git diff
diff --git a/xapian-core/examples/quest.cc b/xapian-core/examples/quest.cc
index 9c199c340d5f..d79a7bcf414d 100644
--- a/xapian-core/examples/quest.cc
+++ b/xapian-core/examples/quest.cc
@@ -37,15 +37,7 @@ using namespace std;
// Stopwords:
static const char * sw[] = {
- "a", "about", "an", "and", "are", "as", "at",
- "be", "by",
- "en",
- "for", "from",
- "how",
- "i", "in", "is", "it",
- "of", "on", "or",
- "that", "the", "this", "to",
- "was", "what", "when", "where", "which", "who", "why", "will", "with"
+ "la", "le"
};
struct qp_flag { const char * s; unsigned f; };
@@ -362,7 +354,7 @@ try {
parser.set_database(db);
parser.set_stemmer(stemmer);
- parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME);
+ parser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL);
parser.set_stopper(&mystopper);
Xapian::Query query = parser.parse_query(argv[optind], flags);
$ examples/quest --stemmer fr 'le camion'
Parsed Query: Query(camion at 2)
No database specified so not running the query.
$ examples/quest --stemmer fr 'lea seydoux'
Parsed Query: Query((le at 1 OR seydoux at 2))
No database specified so not running the query.
}}}
So `le` is handled as a stopword, and `lea` is stemmed to `le` and
included in the search.
There's one slight wrinkle, which is that for terms where search-time
stopwording is suppressed (e.g. because `le` is used in a phrase, or an
individual `le` is quoted, or when the query is entirely composed of
stopwords) then `le` in the query won't be stopped and will match `lea` in
the document, e.g.:
{{{
$ examples/quest --stemmer fr '"le voiture"'
Parsed Query: Query((le at 1 PHRASE 2 voitur at 2))
No database specified so not running the query.
$ examples/quest --stemmer fr '"le" voiture'
Parsed Query: Query((le at 1 OR voitur at 2))
No database specified so not running the query.
$ examples/quest --stemmer fr 'le la'
Parsed Query: Query((le at 1 OR la at 2))
No database specified so not running the query.
}}}
I think to handle such cases, we'd probably need to explicitly teach
`QueryParser` about the different stop strategies. If we're removing
stopwords at index time, then in cases like the above it could do
something more appropriate.
You could switch to `STEM_ALL_Z` - that's the default mode of operation,
and allows for exact matching of words and exact phrase searches, which
you can't achieve if you only index the stemmed forms. The downside is
that the database will be larger.
--
Ticket URL: <https://trac.xapian.org/ticket/750#comment:8>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list