[Xapian-discuss] weight question

Fri Nov 2 05:43:45 GMT 2007

On Thu, Nov 01, 2007 at 09:49:39PM -0700, Andrey wrote:
> I have a document reads:
> "I am eating an apple while using apple computer"
> 
> My xapian query:
> apple(weight:4)
> computer(weight:3)
> 
> instead of getting a weight of 11 of this doc (2Xapple 1Xcomputer), how to 
> make the matching in boolean way so i will get a weight of 7 for this 
> document?

If I understand correctly, you want to ignore the wdf of terms - you can
do that by setting BM25's k1 parameter to 0:

http://www.xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#_details

That's not what I'd call "boolean" weighting though, so perhaps I'm
misunderstanding you...

> Is it possible to add "penalty" in a query?
> docA = "How to eat an apple while using apple computer"
> docB = "I am eating an apple while using apple computer"
> 
> Query(apple:4,computer:3,how:-1) << is it possible to penalty / lost weight 
> when doc has the term "how" so the docB ranks heigher?

I don't think that's currently possible without indexing each document
which doesn't contain "how" with a "XNOThow" term, or something similar.

Several of the matcher's optimisations rely on the current fact that
terms can't contribute a negative amount, so I think the only way to do
this would be to add something to all documents which don't contain
"how".  It would probably be possible to implement a query operator
which did that.

You can completely exclude documents which contain a particular term
though, using OP_AND_NOT.

> how heavy will it be if i add a value of "hash(md5  HTML<title> X 
> websiteDomain)" to each document, and then use this key to collapse 
> duplicated-title-in-domain using set_collapse_key? is it way too heavy?

How much overhead it incurs will depend on the nature of your data (for
example if the sites you are indexing each have millions of pages with
each title, the cost will probably be higher as you'll be rejecting a
large number of matches).

It's not an obviously ridiculous idea in general, so all I can really
suggest is that you try it on your data and see if it performs
acceptably.

Cheers,
    Olly