[Xapian-discuss] indexing for phrase searching and constructing
queries
Richard Jolly
richardjolly at mac.com
Fri Jan 26 08:06:14 GMT 2007
On 25 Jan 2007, at 04:06, Olly Betts wrote:
> On 21/01/07, Richard Jolly <richardjolly at mac.com> wrote:
>> 1. phrase searching
>>
>> I'm having no luck getting phrase searching to work. I expect it's
>> because I've not indexed the content correctly. The content is xml.
>> I'm
>> basically taking the text content of that, splitting it into words,
>> lower casing, stemming and stripping of punctuation. The term position
>> passed to add_posting is just incremented, but I'm keeping the same
>> position for both the stemmed and the unstemmed words.
>>
>> # made up
>> add_posting( 'office', 3 )
>> add_posting( 'offic', 3 )
>>
>> My hand-wavy understanding of phrase searching is that it's looking
>> for
>> consecutive matching terms
>
> Yes, that's correct.
>
>> which is why I've done the stemmed and
>> unstemmed words at the same position. But when I do a query, I get no
>> results. The debug on the query look sane to me:
>>
>> Xapian::Query((impose:(pos=1) PHRASE 3 time:(pos=2) PHRASE 3
>> limits:(pos=3)))
>>
>> How can I tell why this isn't matching? Can I find those three posts
>> in
>> the index and compare the positions?
>
> Use "delve" - it's in the examples subdirectory of xapian-core.
Ok, delve makes a lot of sense. Thanks.
I've changed my indexes. Previously I was doing two things I thought
would improve matching. First, I removed short words from the source
text. Then I lower casing and steming each word in it - but added all
the combinations as posts. Basically I was trying to add as many
possible matches as I could.
Now I lower case all text, split on whitespace to form words, then
remove punctuation. I might put stemming back in, but I haven't yet. Is
there best practice for this, or common strategies?
I'm particularly curious about punctuation.
I guess the general lesson is that whatever you do do the source text
to index it should be also be done to the query entered.
>> Secondly, a user entered search with an apostrophe ends up as a phrase
>> search - not right at all:
>>
>> Xapian::Query(((mike:(pos=1) PHRASE 2 s:(pos=2)) OR tail:(pos=3)))
>
> This is how Omega currently works, and what the QueryParser does.
> It's a misfeature really (particularly since it produces a more
> expensive phrase search for a case where we don't need to do one).
> I'm intending to change this for Xapian 1.0.
>
>> 2. user interfaces
>> My next question is about the practicalities of user facing search
>> interfaces. I've got a web form with a big text input, and also a
>> couple additional controls that correspond to indexed terms. I've then
>> got code that combines the term controls with the text input into
>> something like:
>>
>> ( name:foo AND name:bar ) AND text from text box
>>
>> And I hand this off to QueryParser. But punctuation seems to mess it
>> up. Should I be stripping out punctuation and stop words? Is it a bad
>> approach all together?
>
> It's generally a mistake to try to manipulate user entered text before
> passing it to the QueryParser class. It's better to let the
> QueryParser parse the user entered query and then apply additional
> filters etc to the Query object produced.
I've stopped mucking with the text - and the code does look a lot
cleaner.
> I take it you have a "name" box and a "text" box? If so, you'd
> ideally want to parse each separately using a QueryParser object with
> one set to default to "name:", but currently I don't think you can
> (I'll take a look in a week or so when I'm back from holiday as this
> should be easy to do).
>
> For now, I'd suggest wrapping the "name" field in `name:(' and `)' and
> parsing that, then combining it with the result of parsing the "text"
> field with OP_AND.
I get the idea, but I haven't quite got it working.
Thanks for you help,
Richard
More information about the Xapian-discuss
mailing list