[Xapian-devel] Project: QueryParser Reimplementation, to Olly Betts and Dan Colish

Sun Apr 1 07:05:06 BST 2012

*Hi all,*
*
*
*The following is my general idea for the project. For a complete query
parser I still need to consider more details. Please give me feedback
because the description of this project is lack of detailed information,
and I can submit my proposal without giant deviation.*
*
*
design principle of query parsing:
1) better understanding user input. All search engine do is understanding
what human is trying to find and return the best matching results.
2) balance between query time and precision. Users’ time is more valuable,
they can’t wait 5 more minutes for a better result. So how to achieve the
best result within certain amount of time is important.

Here are my initial ideas of free text query parsing.
1) deal with mis-spelling. User maybe input mis-spelling words, thus our
search engine should support tolerant retrieval. Search engine infers the
correctness of a word based on the whether the word appears in dictionary
and the number of returned documents. If the number of returned documents
less than a preset value, it is more likely to be a mis-spelling word.
Thus, we search for the potential correctly spelling  words. We can first
employ k-gram indexes to further limit the set of vocabulary terms, then
compute the edit distance of strings.
2) deal with synonyms. This can be done by index unnormalized tokens and
maintain a query expansion list of multiple vocabulary entries to consider
for a certain query term. A query term is then effectively a disjunction of
several postings list. Also, an alternative is to do the expansion during
index construction.
3) deal with stop words. In modern search engines this is not a good idea,
so I think we can just discard it.
4) dividing the queries. User input might be long sentence and contains
‘stop words’. Search engine achieves better return result by doing multiple
queries based on the dividing of the original query string. How to divide
the query is the most difficult. Here’s some idea.
 a. documents contain more consecutive matching words rank higher.
b. different words have different weight (e.g., tf-idf), documents with
bigger weight rank higher.
 c. use k-gram to divide long query and limit the query times.

Thank you.

-- 
Weixian Zhou
Department of Computer Science and Engineering
University at Buffalo, SUNY
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120401/32fdeaff/attachment.htm>