[Xapian-devel] Bug and patch for +terms with wildcards

richard at lemurconsulting.com richard at lemurconsulting.com
Wed Dec 6 00:04:24 GMT 2006


In current Xapian SVN HEAD, there is a bug in the query parser concerned
with the handling of wildcard terms with a "+" prefix.  Specifically,
a query such as "+foo* bar" will be parsed by the query parser into
Xapian::Query("bar") if there are no terms in the database which start
"foo".  Instead, since the "+" term cannot be matched, I believe this query
should return no documents.

I've put a patch together to fix this issue, but it requires the
introduction of a new query operator, to mark a query as matching no
documents (as opposed to a query created with the default query
constructor, which represents an undefined query).  I've called this
operator "OP_MATCH_NOTHING", and it takes no subqueries.

I believe this should be public, since it may be useful for people trying
to write their own query parsers, rather than relying on the builtin query
parser.

It's possible that a similar approach would be a neat solution for
representing "alldocument" queries.  Currently, a special query term can be
created which matches all documents by creating a leaf query for which the
term is the null string.  This is a somewhat "magic" and unobvious approach
- instead, we could introduce an "OP_MATCH_ALL" nullary operator, which
would be converted to a postlist which matches all documents.  It's not
clear why an empty term should magically match all documents, rather than
none, or indeed why it should have any special meaning.

The patch includes test cases for the bug, fixes the problem, and I think
is the neatest approach, but since it adds a new, publically visible,
operator to Xapian::Query, I thought I'd better put the patch up for review
on the mailing list rather than commit it directly.

It's attached to this email, so, comments welcome!

-- 
Richard
-------------- next part --------------
Index: queryparser/queryparser.lemony
===================================================================
--- queryparser/queryparser.lemony	(revision 7552)
+++ queryparser/queryparser.lemony	(working copy)
@@ -148,6 +148,9 @@
 	add_to_query(q, Query::OP_OR, Query(*t, 1, pos));
 	++t;
     }
+    if (q.empty()) {
+        return new Query(Query::OP_MATCH_NOTHING);
+    }
     return new Query(q);
 }
 
Index: matcher/localmatch.cc
===================================================================
--- matcher/localmatch.cc	(revision 7552)
+++ matcher/localmatch.cc	(working copy)
@@ -468,6 +468,12 @@
 					postlist_from_query(query->subqs[1], matcher, is_bool),
 					matcher,
 					db->get_doccount());
+	case Xapian::Query::OP_MATCH_NOTHING: {
+            Assert(query->subqs.size() == 0);
+            LeafPostList *pl = new EmptyPostList();
+            pl->set_termweight(new Xapian::BoolWeight());
+            RETURN(pl);
+        }
     }
     Assert(false);
     RETURN(NULL);
Index: tests/queryparsertest.cc
===================================================================
--- tests/queryparsertest.cc	(revision 7552)
+++ tests/queryparsertest.cc	(working copy)
@@ -655,7 +655,7 @@
     qobj = queryparser.parse_query("muscle*", Xapian::QueryParser::FLAG_WILDCARD);
     TEST_EQUAL(qobj.get_description(), "Xapian::Query((muscle:(pos=1) OR musclebound:(pos=1)))");
     qobj = queryparser.parse_query("meat*", Xapian::QueryParser::FLAG_WILDCARD);
-    TEST_EQUAL(qobj.get_description(), "Xapian::Query()");
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)");
     qobj = queryparser.parse_query("musc*", Xapian::QueryParser::FLAG_WILDCARD);
     TEST_EQUAL(qobj.get_description(), "Xapian::Query((muscat:(pos=1) OR muscle:(pos=1) OR musclebound:(pos=1) OR muscular:(pos=1)))");
     qobj = queryparser.parse_query("mutt*", Xapian::QueryParser::FLAG_WILDCARD);
@@ -664,6 +664,29 @@
     // were in the database or not):
     qobj = queryparser.parse_query("mUTTON++");
     TEST_EQUAL(qobj.get_description(), "Xapian::Query(mutton:(pos=1))");
+    // Regression test: check that wildcards work with +terms.
+    unsigned flags = Xapian::QueryParser::FLAG_WILDCARD |
+                     Xapian::QueryParser::FLAG_LOVEHATE;
+    qobj = queryparser.parse_query("+mai* main", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query((main:(pos=1) AND_MAYBE main:(pos=2)))");
+    // Regression test (if we had a +term which was a wildcard and wasn't
+    // present, the query could still match documents).
+    qobj = queryparser.parse_query("foo* main", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(main:(pos=2))");
+    qobj = queryparser.parse_query("+foo* main", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)");
+    qobj = queryparser.parse_query("foo* +main", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(main:(pos=2))");
+    qobj = queryparser.parse_query("+foo* +main", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)");
+    qobj = queryparser.parse_query("foo* mai", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(mai:(pos=2))");
+    qobj = queryparser.parse_query("+foo* mai", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)");
+    qobj = queryparser.parse_query("foo* +mai", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(mai:(pos=2))");
+    qobj = queryparser.parse_query("+foo* +mai", flags);
+    TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)");
     return true;
 }
 
Index: tests/api_anydb.cc
===================================================================
--- tests/api_anydb.cc	(revision 7552)
+++ tests/api_anydb.cc	(working copy)
@@ -208,6 +208,39 @@
     return true;
 }
 
+// tests for the right document count for a wildcard query
+static bool test_wildquery1()
+{
+    Xapian::QueryParser queryparser;
+    unsigned flags = Xapian::QueryParser::FLAG_WILDCARD |
+                     Xapian::QueryParser::FLAG_LOVEHATE;
+    queryparser.set_stemmer(Xapian::Stem("english"));
+    queryparser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL);
+    Xapian::Database db = get_database("apitest_simpledata");
+    queryparser.set_database(db);
+    Xapian::Enquire enquire(db);
+
+    Xapian::Query qobj = queryparser.parse_query("th*", flags);
+    enquire.set_query(qobj);
+    Xapian::MSet mymset = enquire.get_mset(0, 10);
+    // Check that 6 documents were returned.
+    TEST_MSET_SIZE(mymset, 6);
+
+    qobj = queryparser.parse_query("notindb* this", flags);
+    enquire.set_query(qobj);
+    mymset = enquire.get_mset(0, 10);
+    // Check that 6 documents were returned.
+    TEST_MSET_SIZE(mymset, 6);
+
+    qobj = queryparser.parse_query("+notindb* this", flags);
+    enquire.set_query(qobj);
+    mymset = enquire.get_mset(0, 10);
+    // Check that 0 documents were returned.
+    TEST_MSET_SIZE(mymset, 0);
+
+    return true;
+}
+
 // tests a query across multiple databases
 static bool test_multidb1()
 {
@@ -1644,6 +1677,7 @@
     {"simplequery1",       test_simplequery1},
     {"simplequery2",       test_simplequery2},
     {"simplequery3",       test_simplequery3},
+    {"wildquery1",         test_wildquery1},
     {"multidb1",           test_multidb1},
     {"multidb2",           test_multidb2},
     {"multidb3",           test_multidb3},
Index: include/xapian/query.h
===================================================================
--- include/xapian/query.h	(revision 7553)
+++ include/xapian/query.h	(working copy)
@@ -95,7 +95,15 @@
 	    /** Select an elite set from the subqueries, and perform
 	     *  a query with these combined as an OR query.
 	     */
-	    OP_ELITE_SET = 10
+	    OP_ELITE_SET = 10,
+
+            /** Match no documents.
+             *
+             *  This operator takes no subqueries, and matches no documents.
+             *  It can be a useful placeholder when parsing query strings, if
+             *  a portion of the query can be shown to match no documents.
+             */
+            OP_MATCH_NOTHING
 	} op;
 
 	/** Copy constructor. */
@@ -150,6 +158,9 @@
 	/** Apply the specified operator to a single Xapian::Query object. */
 	Query(Query::op op_, Xapian::Query q);
 
+	/** Make a query with a nullary operator (ie, no subqueries). */
+	Query(Query::op op_);
+
 	/** Get the length of the query, used by some ranking formulae.
 	 *  This value is calculated automatically - if you want to override
 	 *  it you can pass a different value to Enquire::set_query().
@@ -285,6 +296,12 @@
 	 */
 	void validate_query() const;
 
+        /** Simplify any matchnothing subqueries, either eliminating them,
+         *  or setting this query to matchnothing, depending on the query
+         *  operator.
+         */
+        void simplify_matchnothing();
+
 	/** Get a string describing the given query type.
 	 */
 	static std::string get_op_name(Xapian::Query::Internal::op_t op);
Index: ChangeLog
===================================================================
--- ChangeLog	(revision 7553)
+++ ChangeLog	(working copy)
@@ -1,3 +1,24 @@
+Tue Dec 05 22:25:28 GMT 2006  Richard Boulton <richard at lemurconsulting.com>
+
+        * queryparser/queryparser.lemony: Fix parsing of queries of the
+          form "+foo* bar", where no terms in the database match the
+          wildcard "foo*", but bar does exist in the database.  Previously,
+          such queries would be equivalent to "bar".  Now, they will match
+          no documents.
+        * include/xapian/query.h,api/omqueryinternal.cc,api/omquery.cc: Add
+          an OM_MATCH_NOTHING query operator, and add a constructor for
+          nullary query operators (ie, queries with no subqueries).  A
+          query with this operator represents a query which matches no
+          documents (such as a wildcard query which expands to no terms).
+          Add support for combining OM_MATCH_NOTHING with other query
+          operators, in the simplify step.  Also, add support for
+          serialising queries with operator OM_MATCH_NOTHING.
+        * matcher/localmatch.cc: Convert an OM_MATCH_NOTHING query into an
+          EmptyPostList() at match time.
+        * tests/api_anydb.cc,tests/queryparsertest.cc: Test parsing of
+          wildcard queries with +terms, and performing a match with a query
+          with operator OM_MATCH_NOTHING resulting from a wildcard query.
+
 Tue Dec 05 21:12:12 GMT 2006  Richard Boulton <richard at lemurconsulting.com>
 
         * include/xapian/query.h,api/omqueryinternal.cc: Fix query
Index: api/omqueryinternal.cc
===================================================================
--- api/omqueryinternal.cc	(revision 7553)
+++ api/omqueryinternal.cc	(working copy)
@@ -54,6 +54,7 @@
 	case Xapian::Query::OP_NEAR:
 	case Xapian::Query::OP_PHRASE:
 	case Xapian::Query::OP_ELITE_SET:
+        case Xapian::Query::OP_MATCH_NOTHING:
 	    return 0;
 	case Xapian::Query::OP_FILTER:
 	case Xapian::Query::OP_AND_MAYBE:
@@ -70,6 +71,7 @@
 {
     switch (op) {
 	case Xapian::Query::Internal::OP_LEAF:
+        case Xapian::Query::OP_MATCH_NOTHING:
 	    return 0;
 	case Xapian::Query::OP_FILTER:
 	case Xapian::Query::OP_AND_MAYBE:
@@ -166,6 +168,9 @@
 	    case Xapian::Query::OP_ELITE_SET:
 		result += "*" + om_tostring(parameter);
 		break;
+            case Xapian::Query::OP_MATCH_NOTHING:
+		result += "!";
+		break;
 	}
     }
     return result;
@@ -189,6 +194,7 @@
 	case Xapian::Query::OP_NEAR:            name = "NEAR"; break;
 	case Xapian::Query::OP_PHRASE:          name = "PHRASE"; break;
 	case Xapian::Query::OP_ELITE_SET:       name = "ELITE_SET"; break;
+	case Xapian::Query::OP_MATCH_NOTHING:   name = "MATCH_NOTHING"; break;
     }
     return name;
 }
@@ -211,6 +217,9 @@
 	if (tname.empty()) return "<alldocuments>" + opstr;
 	return tname + opstr;
     }
+    if (op == Xapian::Query::OP_MATCH_NOTHING) {
+        return "<nodocuments>";
+    }
 
     opstr = " " + get_op_name(op) + " ";
     if (op == Xapian::Query::OP_NEAR ||
@@ -414,6 +423,9 @@
 		    return qint_from_vector(Xapian::Query::OP_ELITE_SET, subqs,
 					    elite_set_size);
 	        }
+	        case '!': {
+		    return new Xapian::Query::Internal(Xapian::Query::OP_MATCH_NOTHING, 0);
+	        }
 	        default:
 		    DEBUGLINE(UNKNOWN, "Can't parse remainder `" << p - 1 << "'");
 		    throw Xapian::InvalidArgumentError("Invalid query string");
@@ -554,6 +566,68 @@
     }
 }
 
+void
+Xapian::Query::Internal::simplify_matchnothing()
+{
+    subquery_list::iterator sq;
+    switch (op) {
+        case OP_PHRASE:
+        case OP_NEAR:
+        case OP_AND:
+        case OP_FILTER:
+            // Doing an "AND" type operation - if we've got any MATCH_NOTHING
+            // nodes, we match nothing.
+            for (sq = subqs.begin(); sq != subqs.end(); sq++) {
+                if ((*sq)->op == OP_MATCH_NOTHING) {
+                    for (sq = subqs.begin(); sq != subqs.end(); sq++)
+                        delete *sq;
+                    subqs.clear();
+                    op = OP_MATCH_NOTHING;
+                    return;
+                }
+            }
+            break;
+        case OP_ELITE_SET:
+        case OP_OR:
+        case OP_XOR:
+            // Doing an "OR" type operation - if we've got any MATCH_NOTHING
+            // subnodes, drop them; except that we mustn't become an empty
+            // node due to this, so we never drop a MATCH_NOTHING subnode
+            // if it's the only subnode.
+            sq = subqs.begin();
+            while (sq != subqs.end() && subqs.size() > 1) {
+                if ((*sq)->op == OP_MATCH_NOTHING) {
+                    delete *sq;
+                    sq = subqs.erase(sq);
+                } else {
+                    ++sq;
+                }
+            }
+            break;
+        case OP_AND_MAYBE:
+            // If left hand side is MATCH_NOTHING, we match nothing.
+            // If right hand side is MATCH_NOTHING, replace node with LHS.
+            // So, if either node is MATCH_NOTHING, replace node with LHS.
+            // Easiest way to do this is to remove the right hand node,
+            // and let simplify_query() perform the replacement of
+            // the unary operator with 
+            Assert(subqs.size() == 2);
+            if (subqs[0]->op == OP_MATCH_NOTHING ||
+                subqs[1]->op == OP_MATCH_NOTHING) {
+                sq = subqs.begin();
+                sq++;
+                delete *sq;
+                subqs.erase(sq);
+            }
+
+            break;
+        case OP_LEAF:
+        case OP_MATCH_NOTHING:
+            // Do nothing.
+            break;
+    }
+}
+
 Xapian::Query::Internal *
 Xapian::Query::Internal::simplify_query()
 {
@@ -588,9 +662,14 @@
 	    break;
     }
 
-    // If we have no subqueries, then we're simply an undefined query.
-    if (subqs.empty()) return 0;
+    // Simplify any "MATCH_NOTHING" nodes.
+    simplify_matchnothing();
 
+    // If we have no subqueries, then we're simply an undefined query,
+    // unless we've got a nullary operator, such as OP_MATCH_NOTHING.
+    if (subqs.empty() && op != OP_MATCH_NOTHING)
+        return 0;
+
     // Any nodes which are valid with only one subquery can be replaced by
     // that solitary subquery.
     if (subqs.size() == 1) {
Index: api/omquery.cc
===================================================================
--- api/omquery.cc	(revision 7552)
+++ api/omquery.cc	(working copy)
@@ -132,6 +132,17 @@
     }
 }
 
+Query::Query(Query::op op_) : internal(0)
+{
+    try {
+	start_construction(op_, 0);
+	end_construction();
+    } catch (...) {
+	abort_construction();
+	throw;
+    }
+}
+
 // Copy constructor
 Query::Query(const Query & copyme)
 	: internal(copyme.internal)


More information about the Xapian-devel mailing list