[Xapian-tickets] [Xapian] #216: Inconsistent return values for percentage weights
Xapian
nobody at xapian.org
Wed Jul 16 02:45:53 BST 2008
#216: Inconsistent return values for percentage weights
---------------------+------------------------------------------------------
Reporter: richard | Owner: olly
Type: defect | Status: assigned
Priority: normal | Milestone: 1.0.8
Component: Matcher | Version: SVN HEAD
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
---------------------+------------------------------------------------------
Changes (by olly):
* status: new => assigned
Old description:
> When results are being sorted by a value, the percentage values for the
> results
> returned are normalised based on the document in the portion of the mset
> requested which has the highest weight, instead of the document matching
> the
> query which has the highest weight. I have a testcase demonstrating this
> which
> I will attach shortly.
>
> This is because, in multimatch.cc, we calculate "best" by looking for the
> highest weighted document in the candidate mset, but when sorting by
> anything
> other than relevance, the highest weighted document may have been
> discarded already.
>
> It is hard to see how to fix this - one obvious approach would be to
> check every
> candidate document's weight before discarding it during the match
> process, and
> keep track the docid of the document with the highest weight seen so far.
> However, we currently don't calculate the weight for all the documents we
> see
> (because we first check the document against the lowest document in the
> mset
> using mcmp), so this would force us to calculate the weights on documents
> we
> wouldn't otherwise need to calculate it for. Since the percentages
> aren't
> necessarily even wanted, this seems a shame.
>
> Perhaps a reasonable approach would be to add a flag on enquire which
> governed
> whether percentages were wanted or not; it would then be more reasonable
> to go
> to extra effort to keep track of the highest weighted document if the
> percentages were actually desired.
New description:
When results are being sorted primarily by an order other than relevance
(e.g. {{{sort_by_value()}}}), the percentage values returned by the MSet
object may be incorrect because they are
calculated based on the document in the portion of the MSet
requested which has the highest weight, instead of the document matching
the
query which has the highest weight.
This issue has existed in all previous Xapian releases, as far as we can
tell.
There is currently no fix in progress, since it is probably not possible
to fix without significant loss of efficiency, which would
adversely affect users who aren't interested in the percentage scores.
If you really need percentage scores in this situation, one workaround
would be to first run the search using relevance order, asking for only
the top document, and to remember the weight and percentage assigned to
that document. Then, re-run the search in sorted order, and calculate the
percentages yourself from the weights assigned to the results, using this
information.
A testcase demonstrating this is attached to this ticket.
The issue is that in multimatch.cc, we calculate "best" by looking for the
highest weighted document in the candidate mset, but when sorting by
anything
other than relevance, the highest weighted document may have been
discarded already.
It is hard to see how to fix this - one obvious approach would be to check
every
candidate document's weight before discarding it during the match process,
and
keep track the docid of the document with the highest weight seen so far.
However, we currently don't calculate the weight for all the documents we
see
(because we first check the document against the lowest document in the
mset
using mcmp), so this would force us to calculate the weights on documents
we
wouldn't otherwise need to calculate it for. Since the percentages aren't
necessarily even wanted, this seems a shame.
Perhaps a reasonable approach would be to add a flag on enquire which
governed
whether percentages were wanted or not; it would then be more reasonable
to go
to extra effort to keep track of the highest weighted document if the
percentages were actually desired.
--
Comment:
Merge the ReleaseNotes entry from 1.0.7 into the description to try to
keep information about this issue in one place.
--
Ticket URL: <http://trac.xapian.org/ticket/216#comment:11>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list