[Xapian-devel] adaptive query scoring

Tue May 16 17:29:12 BST 2006

>> Is there a way to do adaptive query scoring (as in popular results
>> returned by a query should get more weight because they are getting
>> clicked more often) in xapian?  Is this what the rset class should be
>> used for?
>>     
>
> You could use the RSet to achieve something like this by recording
> which documents users like for which queries and setting an RSet from
> that when there's a query for the same terms.  It would probably
> make sense to use a second Xapian database to store the queries matching
> each document click so you'd run a search on that to find what to set
> the RSet as on the main database.
>   
Which approach do you think would be easier - and more importantly, give
the least overhead?  It seems to me that adding adaptive-terms (or
whatever would be a good term for these!) and just rewrite the queries
and work on one xapian db only would mean less overhead (and less
maintenance). What do you think?  Would you be able to be as versatile
with the RSet approach, ie use the adjacent-word approach like you
suggest below?
>   
>> I could write a php app to do adaptive results scoring for separate
>> words (just recording the clicks and then have a cron:ned script add
>> weight to the document_id:s for the recorded words)
>>     
>
> That would be another way - you could add a prefixed term (e.g.
> XCLICKfoo) to those documents which the user selected when they
> had searched for "foo".  Then turn a search term "foo" into
> (foo ANDMAYBE XCLICKfoo) (must match foo, if XCLICKfoo also matches
> add the weight from that.)
>   
Yep, this sounds workable.
Does the ANDMAYBE operator add much overhead to queries?  Would it be
faster to just use the OR operator?  If a result matches the XCLICK*
term, it _must_ also match the original term.

> I'm not totally sure that matters - for the example you give, there's
> going to be a very strong correlation.
Not a very good example, agreed :)
>   There certainly are words which
> have many meanings where there's less correlation (e.g. 'stock market'
> vs 'vegetable stock') and even word order can make a big difference
> (e.g. 'oil bath' vs 'bath oil').  But for the 'stock' example, a query
> for just 'stock' could useful promote results from both, and a query
> for 'stock market' would have 'market' in too, so although the cookery
> pages would get a boost, the financial pages would get larger one.
>   
Yeah there must be tons of word pairs out there that would benefit from
some sort of 'mutual' scheme, but then there are probably a great deal
that would suffer from them too. Especially in our data set.
> In fact, I suspect you would improve retrieval overall simply by
> favouring pages which somebody has clicked on for some query (especially
> for a search over random web sites - the web is full of useless junk
> which nobody will ever want in their results).  That approach is
> particularly susceptible to "clickbot" abuse though.
>   
I have a pretty special set of data to search on. I am building a search
app for a large shopping portal, and the data I search through comes
from merchants product feeds. Since our users are for the most time
logged in when they use our site, I can mitigate clickbotting quite well
by only letting each user throw one 'vote' per word/phrase and day, or
maybe per word/phrase ever. Sacrifices some 'input' from non-logged in
users, but at least makes clickbotting difficult. Might do it based on
IP for non-logged in users. Sacrifices some NAT-users, but you can't win
all :p

I can see that your theory here of favoring all results that gets
clicked regardless of the query can work for a regular web search, but I
don't think it will pan out as well for us, who have specific products
as results.
> But anyway, if you want to work with phrases, the hard part is to decide
> what's a phrase.  Then just generate a term for the phrase e.g.
> "XCLICKxbox console".  If you're going to treat the whole query as a
> phrase, I'd suggest you try generating terms from adjacent word pairs
> (so 'natural history museum' gives "XCLICKnatural history" and
> "XCLICKhistory museum").
>   
Sounds like a pretty good idea to me. Get a 'mutual' effect, but one
that is limited enough to hopefully keep the adaptive-terms accurate for
the product itself. I shall give it a good think and maybe write an
experimental implementation here later today.
> I'd love to hear how you get on.
>   
Absolutely, I love to get feedback from the creator of the actual search
engine here :)
I'd be happy to contribute code back to the Xapian project if you think
there is any use for it. I can only offer php code, but for example I
have two classes, one indexing and one search class, which may be
suitable for php-examples for other php:ers to look at. They use many
more of the xapian features than the present examples do.

Regards
Alec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/pipermail/xapian-devel/attachments/20060516/38d87056/attachment.htm