[Xapian-tickets] [Xapian] #360: SynonymPostList always requires doclength if wdf is used
Xapian
nobody at xapian.org
Tue Dec 27 10:21:09 GMT 2011
#360: SynonymPostList always requires doclength if wdf is used
---------------------+------------------------------------------------------
Reporter: richard | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone: 1.3.x
Component: Matcher | Version: SVN trunk
Severity: minor | Keywords:
Blockedby: | Platform: All
Blocking: |
---------------------+------------------------------------------------------
Changes (by olly):
* milestone: 1.2.x => 1.3.x
Old description:
> SynonymPostList (in the opsynonym branch), currently clamps computed wdf
> values to the document length. This is to ensure that the wdf does not
> exceed the document length, which is a condition that some weight schemes
> can rely on for computing tight bounds on the maximum weight.
>
> It would be good to avoid having to calculate the doclength for weighting
> schemes which don't require the doclength, but do require the wdf. One
> approach for this would be to ensure that the wdf sum used in op synonym
> only counts each physical term once; though it is hard to do this
> duplicate removal in advance because query tree decay may remove some
> instances of a term being used while leaving others.
New description:
!SynonymPostList currently clamps computed wdf values to the document
length. This is to ensure that the wdf does not exceed the document
length, which is a condition that some weight schemes can rely on for
computing tight bounds on the maximum weight.
It would be good to avoid having to calculate the doclength for weighting
schemes which don't require the doclength, but do require the wdf. One
approach for this would be to ensure that the wdf sum used in op synonym
only counts each physical term once; though it is hard to do this
duplicate removal in advance because query tree decay may remove some
instances of a term being used while leaving others.
--
Comment:
I think this is probably difficult to fix as stated, and the contortions
which would be needed are probably not worth the effort.
But we could add an {{{OP_MAX}}} operator which acts like {{{OP_OR}}} but
returns the greatest weight of any subquery instead of summing them. This
would act in a fairly similar way to {{{OP_SYNONYM}}}, but wouldn't suffer
from the issue here.
I suggested {{{OP_MAX}}} previously without thinking about this issue, and
we concluded it was probably useful to have.
--
Ticket URL: <http://trac.xapian.org/ticket/360#comment:2>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list