[Xapian-tickets] [Xapian] #360: SynonymPostList always requires doclength if wdf is used

Xapian nobody at xapian.org
Tue Dec 27 10:21:09 GMT 2011


#360: SynonymPostList always requires doclength if wdf is used
---------------------+------------------------------------------------------
 Reporter:  richard  |       Owner:  olly     
     Type:  defect   |      Status:  new      
 Priority:  normal   |   Milestone:  1.3.x    
Component:  Matcher  |     Version:  SVN trunk
 Severity:  minor    |    Keywords:           
Blockedby:           |    Platform:  All      
 Blocking:           |  
---------------------+------------------------------------------------------
Changes (by olly):

  * milestone:  1.2.x => 1.3.x


Old description:

> SynonymPostList (in the opsynonym branch), currently clamps computed wdf
> values to the document length.  This is to ensure that the wdf does not
> exceed the document length, which is a condition that some weight schemes
> can rely on for computing tight bounds on the maximum weight.
>
> It would be good to avoid having to calculate the doclength for weighting
> schemes which don't require the doclength, but do require the wdf.  One
> approach for this would be to ensure that the wdf sum used in op synonym
> only counts each physical term once; though it is hard to do this
> duplicate removal in advance because query tree decay may remove some
> instances of a term being used while leaving others.

New description:

 !SynonymPostList currently clamps computed wdf values to the document
 length.  This is to ensure that the wdf does not exceed the document
 length, which is a condition that some weight schemes can rely on for
 computing tight bounds on the maximum weight.

 It would be good to avoid having to calculate the doclength for weighting
 schemes which don't require the doclength, but do require the wdf.  One
 approach for this would be to ensure that the wdf sum used in op synonym
 only counts each physical term once; though it is hard to do this
 duplicate removal in advance because query tree decay may remove some
 instances of a term being used while leaving others.

--

Comment:

 I think this is probably difficult to fix as stated, and the contortions
 which would be needed are probably not worth the effort.

 But we could add an {{{OP_MAX}}} operator which acts like {{{OP_OR}}} but
 returns the greatest weight of any subquery instead of summing them.  This
 would act in a fairly similar way to {{{OP_SYNONYM}}}, but wouldn't suffer
 from the issue here.

 I suggested {{{OP_MAX}}} previously without thinking about this issue, and
 we concluded it was probably useful to have.

-- 
Ticket URL: <http://trac.xapian.org/ticket/360#comment:2>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list