[Xapian-discuss] indexing for phrase searching and constructing queries

Olly Betts olly at survex.com
Thu Jan 25 04:06:52 GMT 2007


On 21/01/07, Richard Jolly <richardjolly at mac.com> wrote:
> 1. phrase searching
>
> I'm having no luck getting phrase searching to work. I expect it's
> because I've not indexed the content correctly. The content is xml. I'm
> basically taking the text content of that, splitting it into words,
> lower casing, stemming and stripping of punctuation. The term position
> passed to add_posting is just incremented, but I'm keeping the same
> position for both the stemmed and the unstemmed words.
>
>   # made up
>   add_posting( 'office', 3 )
>   add_posting( 'offic', 3  )
>
> My hand-wavy understanding of phrase searching is that it's looking for
> consecutive matching terms

Yes, that's correct.

> which is why I've done the stemmed and
> unstemmed words at the same position. But when I do a query, I get no
> results. The debug on the query look sane to me:
>
>   Xapian::Query((impose:(pos=1) PHRASE 3 time:(pos=2) PHRASE 3
> limits:(pos=3)))
>
> How can I tell why this isn't matching? Can I find those three posts in
> the index and compare the positions?

Use "delve" - it's in the examples subdirectory of xapian-core.

> Secondly, a user entered search with an apostrophe ends up as a phrase
> search - not right at all:
>
>   Xapian::Query(((mike:(pos=1) PHRASE 2 s:(pos=2)) OR tail:(pos=3)))

This is how Omega currently works, and what the QueryParser does.
It's a misfeature really (particularly since it produces a more
expensive phrase search for a case where we don't need to do one).
I'm intending to change this for Xapian 1.0.

> 2. user interfaces
> My next question is about the practicalities of user facing search
> interfaces. I've got a web form with a big text input, and also a
> couple additional controls that correspond to indexed terms. I've then
> got code that combines the term controls with the text input into
> something like:
>
>   ( name:foo AND name:bar ) AND text from text box
>
> And I hand this off to QueryParser. But punctuation seems to mess it
> up. Should I be stripping out punctuation and stop words? Is it a bad
> approach all together?

It's generally a mistake to try to manipulate user entered text before
passing it to the QueryParser class.  It's better to let the
QueryParser parse the user entered query and then apply additional
filters etc to the Query object produced.

I take it you have a "name" box and a "text" box?  If so, you'd
ideally want to parse each separately using a QueryParser object with
one set to default to "name:", but currently I don't think you can
(I'll take a look in a week or so when I'm back from holiday as this
should be easy to do).

For now, I'd suggest wrapping the "name" field in `name:(' and `)' and
parsing that, then combining it with the result of parsing the "text"
field with OP_AND.

Cheers,
    Olly



More information about the Xapian-discuss mailing list