[Xapian-devel] Custom weight factors - pushing the relevancy ranking how we want it

Fri Dec 17 10:28:42 GMT 2004

On Fri, Dec 17, 2004 at 11:06:41AM +0100, Michiel Roding wrote:

>    As forums are, the content that is relevant to a search is not just
>    determined by the frequency or location of the terms; the date the topic
>    has been last modified is important as well.
>    Another issue we find is that the amount of results is so overwhelming,
>    the user is unable to find the correct topic for his needs. Combining this
>    with some statistics, we found that a very large part of the queries to
>    Omega are the same. Keywords like windows, xp, dvd etc. are very popular.
>    Therefore, we are contemplating to build a "does this topic meet your
>    search?" feature to store which topics are most relevant to the queries as
>    defined by the users.
>    Other features could be a lame attempt at the PageRank relevancy, storing
>    if a user almost immediatly skips a topic (irrelevant) etc.
> 
>    But, this needs to be stored (easy) and processed by Xapian in the
>    sorting.
> 
>    How could we go about this? Does Xapian somehow support these custom
>    weight factors?

There's currently no way of using document values (pieces of
information stored about the document) in the mix to calculate weights
- you can write your own Weight scheme, but without access to the
docid you can't look up things like this. I don't know enough about
the internals of the matcher to know what a performance hit adding
this kind of support would be.

Two things occur to me. Firstly, you could have a special term which
you mix in to your probabilistic term list which means "this is a good
topic". So if Xgoodtopic exists once, it means one user liked it. (You
could add them automatically if the user doesn't skip, or you could
make it explicit.) You might then have some luck with playing with the
wdf of that term to boost some documents. The problem is that you'll
end up with the corpus frequency of that term being very high, which
will downplay the effect of it on document rank. I suppose you could
have XT<topic> for every single-term search, so XTwindows, XTdvd and
so on. That would keep the corpus frequency down to a more manageable
level, perhaps.

The other thing is that you make have luck with trying to
automatically segment your top results. Say you grab the first 20, you
could then see how similar these results are. One way of doing this
that might work (but Olly or Richard will be able to give you a better
answer :-) would be to get the ESet for the query with the RSet as
each document in the MSet in turn, throwing the terms from the ESet
back into the query and seeing which other documents from the original
MSet come out of that new query. That should enable you to group
related results to some extent, although it will depend on how your
topics work to some extent.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org