[Xapian-discuss] Phrase Query vs AND Query? Why don't these find the same things?

jarrod roberson jarrod at vertigrated.com
Thu Jun 1 17:01:28 BST 2006


On 6/1/06, jarrod roberson <jarrod at vertigrated.com> wrote:
>
>
>
> On 5/31/06, Olly Betts <olly at survex.com> wrote:
>
> > On Tue, May 30, 2006 at 09:06:48PM -0400, jarrod roberson wrote:
>
> I'm not sure I understand what you're trying to search for here.  But
> generally a PHRASE query is better when the terms are only useful in
> a particular order, or when there are common false positives which
> includes all the terms but aren't relevant.
>
> A phrase query is slower than an AND, so that's a reason to favour AND
> if there's no other reason to pick one or the other.
>
> Cheers,
>     Olly
>
>
> Thanks for the reply and the example code on how to use the inmemory
> database
>
> I think I figured out where SOME of the instability in the results is
> coming from.
> I am doing a enquire.get_mset(0,0).get_matches_estimated() to get how many
> items I should really query for.
> using enquire.get_mset(0,1).get_matches_estimated() seems to fix what used
> to work with 0,0 ( I just upgraded to 0.9.5 a couple of weeks ago )
> It seems that some queries work with this better than others, so . . .
>
> I assume this is not the best way to do this, so I guess I need to know
> what the best idiom to get back ALL matches to a query.
>
> I know it is not common, but I will always be wanting all results back for
> this particular project.
>


Thanks again, Olly, your suggestions helped me figure out exactly what was
going wrong.
I know I am answering my own question, but I wanted to make sure this is the
CORRECT solution, just because it works doesn't mean it is the CORRECT
solution.

I changed it to this and it returns EXACTLY the number of documents I was
expecting!

#!/usr/bin/env python

import xapian

db = xapian.Database('/index/wfs/')

terms = ["LP:backup", "LP:c:", "LP:program files", "LP:Adobe"]
termQueries = [ xapian.Query( term, 1, pos + 1 ) for pos, term in enumerate(
terms ) ]
phraseQuery = xapian.Query( xapian.Query.OP_PHRASE, termQueries )
userQuery = xapian.Query('MBOX:12345678-1234-1234-1234-1234567890ab')
query = xapian.Query(xapian.Query.OP_AND, userQuery, phraseQuery)

enq = xapian.Enquire(db)
enq.set_query(query)
mset = enq.get_mset(0, db.get_doccount())

print query.get_description()
print "%d matches" % mset.get_matches_estimated()

and the result

Xapian::Query((MBOX:12345678-1234-1234-1234-1234567890ab AND
(LP:backup:(pos=1) PHRASE 4 LP:c::(pos=2) PHRASE 4 LP:program files:(pos=3)
PHRASE 4 LP:Adobe:(pos=4))))
345 matches

I assume that using the posting postiion should make it more effiecient and
more exact right?
Since I only want matches where those terms are in that EXACT positional
order.


More information about the Xapian-discuss mailing list