[Xapian-discuss] word pair indexing and querying

Mark Hagger mark.hagger at m-spatial.com
Thu Sep 21 22:44:27 BST 2006


On Thursday 21 September 2006 18:11, you wrote:
> The document length is taken into account to some extent, but what you
> appear to want is for a document containing terms which are _not_ in the
> query to be heavily penalised.

I'm not entirely sure I mean that, for example whilst I'm happy for a search 
for, say, "bristol job centre" to return a mid-score relevance - perhaps 60% 
or something, I'm not happy with just plain "job" or just plain "centre" 
returning a good score in this case.  Furthermore, I'm happy for the query 
"bristol job centre stuff" to also score quite well, although lower than my 
60% from before.

> I have to ask though: what is wrong with a search for "centre" returning a
> document about "job centre"?  Would you object to a search for "job"
> returning the document?  Do you think your users would really be confused
> by a search for "centre" returning  a document about "job centre"?  If you
> can explain why this is a problem, we'll be more able to do something about
> it.

I'll try and explain where I'm going with this.  The crucial thing here is 
that I need to be able to fairly aggressively answer "no I don't appear to 
have any sensible matches for your query".  So whilst I would agree that 
normally with a search you'd be quite happy for a "potential" match to be 
thrown up, ie centre against job centre, in my case thats a killer, I really 
must try very hard to avoid spurious matches.  Again recall my earlier 
example about "wi fi hot spots", I'd be failing in this way if I returned a 
match for a query "hot" against my wifi hotspots record.

As the dataset is quite small its quite practical to assign, almost by hand, 
keywords/phrases to each record, at least for the more "common" records.

In fact the real problem is even harder than that, for various reasons its 
convenient to not only match the query against one database, but to actually 
try and match simultaneously against multiple disparate databases, in truth 
this is almost attempting, in an albeit simplistic way, to achieve some sort 
of natural language parsing of the query.

Mark

________________________________________________________________________
This email has been scanned for all known viruses by the MessageLabs SkyScan service.



More information about the Xapian-discuss mailing list