[Xapian-devel] Project: QueryParser Reimplementation, to Olly Betts and Dan Colish

Tue Apr 3 06:50:27 BST 2012

There's no reason to Cc me - I read the list.

On Sun, Apr 01, 2012 at 03:24:09PM -0400, Weixian Zhou wrote:
> *The following is my general idea for the project. For a complete query
> parser I still need to consider more details. Please give me feedback
> because the description of this project is lack of detailed information,
> and I can submit my proposal without giant deviation.*
> *
> *
> design principle of query parsing:
> 1) better understanding user input. All search engine do is understanding
> what human is trying to find and return the best matching results.

How are you going to "understand" a human?

> 2) balance between query time and precision. Users? time is more valuable,
> they can?t wait 5 more minutes for a better result. So how to achieve the
> best result within certain amount of time is important.

While the time to execute the query is clearly important, the time to
parse it seems really unlikely to be much of an issue, unless we're
doing something badly wrong.

> Here are my initial ideas of free text query parsing.
> 1) deal with mis-spelling. User maybe input mis-spelling words, thus our
> search engine should support tolerant retrieval. Search engine infers the
> correctness of a word based on the whether the word appears in dictionary
> and the number of returned documents. If the number of returned documents
> less than a preset value, it is more likely to be a mis-spelling word.
> Thus, we search for the potential correctly spelling  words. We can first
> employ k-gram indexes to further limit the set of vocabulary terms, then
> compute the edit distance of strings.

Xapian already supports spelling correction.

> 2) deal with synonyms. This can be done by index unnormalized tokens and
> maintain a query expansion list of multiple vocabulary entries to consider
> for a certain query term. A query term is then effectively a disjunction of
> several postings list. Also, an alternative is to do the expansion during
> index construction.

Xapian already supports synonyms.

> 3) deal with stop words. In modern search engines this is not a good idea,
> so I think we can just discard it.

Xapian already supports stopwords.

> 4) dividing the queries. User input might be long sentence and contains
> ?stop words?. Search engine achieves better return result by doing multiple
> queries based on the dividing of the original query string. How to divide
> the query is the most difficult. Here?s some idea.

What's your evidence for thinking that dividing the query would give
better results?

If you run multiple queries, you get multiple sets of results.  What
do you then do with those?

>  a. documents contain more consecutive matching words rank higher.
> b. different words have different weight (e.g., tf-idf), documents with
> bigger weight rank higher.

I'm not sure I really follow you on point (b) - it's sounds like you're
just describing ranking documents by weight, which we already do.

>  c. use k-gram to divide long query and limit the query times.

You seem to be assuming a simple connection between query size and time
to execute, but the reality is rather more complicated.  For example, a
two term query can be faster than a single term query (because the
optimiser can potentially terminate early with two terms).

Cheers,
    Olly