[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

James Aylett james-xapian at tartarus.org
Tue Oct 2 12:11:01 BST 2007


On Mon, Oct 01, 2007 at 02:01:50PM -0700, Kevin Duraj wrote:

> Search query example: What is Kevin Duraj doing?
> OP_NLP  would analyze sentence as follow:
> [what =  pronoun, question|is =
> werb|kevin=noun|duraj=noun|doing=verb|?=punctuation]

'What' isn't a pronoun, but never mind. You're suggesting a fairly
primitive level of NLP - is there any evidence to suggest this will
give good results? For instance, there's no way you could use that
strategy to deal with referrents. Also, how are you planning on coping
with ambiguity in part-of-speech? ('dove' is both a noun and a verb.)

Couldn't you do this separately to Xapian by judicious fiddling with
the generated query? Get the raw unstemmed terms (aka 'words' in this
context) and figure out how you want to treat them, and construct a
new query which reflects the weighting you want to apply. (Bear in
mind that BM25 takes into account with within-query-frequency of a
term as well as the within-document-frequency, and the defaults
include this.)

I have no idea how this applies to other languages. (Well, I do, but
only for Latin, Romance languages and to an extent Germanics. That's
not all that useful on the web.)

> PS: Can you see the future?

I think it's orange ;-)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list