[Xapian-discuss] Two questions

Sam Liddicott sam at liddicott.com
Mon May 2 22:17:08 BST 2005


roki roki wrote:

>Thanks Sam,
>
>I am using Xapian through Perl module so option MINHITS is not available
>there and in the Perl module is parameter checkatleast not implemented.
>  
>
I don't use the PERL interface but I would be surprised if it is not
available as a matcher option.

>Can I change this directly in enquire.h and recompile it? I need to get
>first 500 grouped results, what would be a correct parameter?
>  
>
Well, this sounds like a lot. I suppose you mean 500 results after
grouping? How many do you expect in a group?
These could involve examining >10,000 hits. One reason the search could
be taking so long is that with collapsing, xapian is struggling to find
enough documents to fill you result set. If you don't have 500 different
values to collapse on it wouldbe producing a result set of your entire
DB each time so no wonder it takes a long time.

It likely is the case that any further documents that match will also be
collapsed, and so not increase the size of your result set. I don't
think the optimiser takes this into account, and I'm not sure if it
could easily as every possible collapse value could not be subordinate
to document terms.

Something that may help you is to make use of or increasing relevance
cutoff. If you are collapsing, you are not going to see the least
relevant hits anyway, so this only becomes a problem as you cut off
entire collapse categories whose best hit was not relevant enough.  You
will have to examine the least relevant results to see if they are
actually useful, and to see if you can increase the relevance cutoff point.

Certainly, if you are collapsing you should not request a result set
that contains more values than there are to be collapsed upon, I suppose
this is something the optimizer can take into account - once it has
collapsed on every possible value (and if it is selecting documents in
order of relevance as it generally does in arought sort of way (I should
apologise to Olly for speaking so vaguely about this)) then it could
stop matching at that point.

So..
1) Make sure you don';t request more documements in the result set than
there are collapse values.
2) make MINHITS=result set size (should be a matcher option)
3) Consider using relevance cutoff

Wiser folk may have other ideas

Sam

>Thanks
>Roki
>
>  
>
>>>Hi there,
>>>I have implemented  Xapian with success on two millions html documents
>>>      
>>>
>>and
>>    
>>
>>>work very very good! Thanks for this nice software! 
>>>
>>>Results are almost alwayse returned in less then 1 second but when I use
>>>set_collapse_key (I really need this)  searches can take up to 10
>>>      
>>>
>>secondes.
>>    
>>
>>>Is there any way to speed up this?
>>> 
>>>
>>>      
>>>
>>Because you are collapsing multiple matching documents into 1 hit,
>>xapian has to find more documents to get the same number of hits.
>>
>>If you are using omega you could try reducing MIN_HITS (may be called
>>MINHITS now) which reduces how many results xapian tries to find, but of
>>course also reduces the accuracy of how many hits xapian thinks there
>>might be.
>>
>>    
>>
>>>Is it possible to get list of most used words in database?
>>>
>>> 
>>>
>>>      
>>>
>>I'll let someone else answer this.
>>
>>Sam
>>
>>    
>>
>>>Thanks!
>>>Roki
>>>
>>> 
>>>
>>>      
>>>
>
>  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20050502/db58a32d/attachment.htm


More information about the Xapian-discuss mailing list