[Xapian-discuss] indexing for phrase searching and constructing queries

Richard Jolly richardjolly at mac.com
Fri Jan 26 08:06:14 GMT 2007


On 25 Jan 2007, at 04:06, Olly Betts wrote:

> On 21/01/07, Richard Jolly <richardjolly at mac.com> wrote:
>> 1. phrase searching
>>
>> I'm having no luck getting phrase searching to work. I expect it's
>> because I've not indexed the content correctly. The content is xml. 
>> I'm
>> basically taking the text content of that, splitting it into words,
>> lower casing, stemming and stripping of punctuation. The term position
>> passed to add_posting is just incremented, but I'm keeping the same
>> position for both the stemmed and the unstemmed words.
>>
>>   # made up
>>   add_posting( 'office', 3 )
>>   add_posting( 'offic', 3  )
>>
>> My hand-wavy understanding of phrase searching is that it's looking 
>> for
>> consecutive matching terms
>
> Yes, that's correct.
>
>> which is why I've done the stemmed and
>> unstemmed words at the same position. But when I do a query, I get no
>> results. The debug on the query look sane to me:
>>
>>   Xapian::Query((impose:(pos=1) PHRASE 3 time:(pos=2) PHRASE 3
>> limits:(pos=3)))
>>
>> How can I tell why this isn't matching? Can I find those three posts 
>> in
>> the index and compare the positions?
>
> Use "delve" - it's in the examples subdirectory of xapian-core.

Ok, delve makes a lot of sense. Thanks.

I've changed my indexes. Previously I was doing two things I thought 
would improve matching. First, I removed short words from the source 
text. Then I lower casing and steming each word in it - but added all 
the combinations as posts. Basically I was trying to add as many 
possible matches as I could.

Now I lower case all text, split on whitespace to form words, then 
remove punctuation. I might put stemming back in, but I haven't yet. Is 
there best practice for this, or common strategies?

I'm particularly curious about punctuation.

I guess the general lesson is that whatever you do do the source text 
to index it should be also be done to the query entered.

>> Secondly, a user entered search with an apostrophe ends up as a phrase
>> search - not right at all:
>>
>>   Xapian::Query(((mike:(pos=1) PHRASE 2 s:(pos=2)) OR tail:(pos=3)))
>
> This is how Omega currently works, and what the QueryParser does.
> It's a misfeature really (particularly since it produces a more
> expensive phrase search for a case where we don't need to do one).
> I'm intending to change this for Xapian 1.0.
>
>> 2. user interfaces
>> My next question is about the practicalities of user facing search
>> interfaces. I've got a web form with a big text input, and also a
>> couple additional controls that correspond to indexed terms. I've then
>> got code that combines the term controls with the text input into
>> something like:
>>
>>   ( name:foo AND name:bar ) AND text from text box
>>
>> And I hand this off to QueryParser. But punctuation seems to mess it
>> up. Should I be stripping out punctuation and stop words? Is it a bad
>> approach all together?
>
> It's generally a mistake to try to manipulate user entered text before
> passing it to the QueryParser class.  It's better to let the
> QueryParser parse the user entered query and then apply additional
> filters etc to the Query object produced.

I've stopped mucking with the text - and the code does look a lot 
cleaner.

> I take it you have a "name" box and a "text" box?  If so, you'd
> ideally want to parse each separately using a QueryParser object with
> one set to default to "name:", but currently I don't think you can
> (I'll take a look in a week or so when I'm back from holiday as this
> should be easy to do).
>
> For now, I'd suggest wrapping the "name" field in `name:(' and `)' and
> parsing that, then combining it with the result of parsing the "text"
> field with OP_AND.

I get the idea, but I haven't quite got it working.


Thanks for you help,

Richard




More information about the Xapian-discuss mailing list