[Xapian-discuss] what does get_eset do?; searching by relevance and value; what is the scope of an enquire object?;

Olly Betts olly at survex.com
Tue Dec 5 00:57:29 GMT 2006


There's no need to cc: me on list mail BTW.

On Mon, Dec 04, 2006 at 09:21:02AM -0500, Jason W. Solinsky wrote:
> 1. What is Enquire.get_eset supposed to do?
> 
> Does it return terms which are:
> 
> b. related to documents that are in the input RSet?

This one.  The Query has an influence in one way - by default you won't
get terms from the query in the ESet (but you can change that).

> 2. What is the preferred way of combining a search by relevance and a search
> by value? Suppose I am running a query that returns thousands of results and
> I want to identify the best 100 documents by some combination of document
> value and relevance to the query. Is the preferred way to first find all the
> desired results, and then filter through them by value, or is there some way
> of improving efficiency by combining both into a single step?

I'm not sure there's a single simple answer which fits all situations.

You can sort by "value then relevance", which will rank by relevance
within bands of equal value.

If you use a suitable weighting scheme (e.g. TradWeight, BM25 with
certain parameters, or a custom weighting scheme) so that different
documents more commonly get exactly the same weight, you can sort by
"relevance then value".

> 3. How does Query::OP_ELITE_SET work?

It was conceived as a way to allow you to take a large piece of text and
turn it into a query by just extracting all the words and combining them
with OP_ELITE_SET.  It was used by webtop.com to implement their
webcheck desktop utility - you could drag any document or text selection
to this and it would search for related pages on the web.  I believe
there was also some client side processing of the text before sending
the terms to the search engine.

So OP_ELITE_SET picks the "best" N terms to search for from a larger
set in some undefined manner.  It's really specified by its intended
effect not its current implementation, so the description which follows
could change in the future.

Currently OP_ELITE_SET looks at what the weighting scheme reports as the
maximum possible weight that the term could return, and picks the N
terms with the highest maximum possible weights.  Exactly what this
means depends on the weighting scheme, but typically it will tend to
avoid common terms.

> 4. What is the intended scope of an Enquire object?

It mostly exists as a container for all the different settings for a
match or query expansion operation.  Otherwise get_mset would take
far too many parameters!

> Is there any particular performance advantage or robustness penalty to using
> a single enquire object across multiple calls to get_eset and get_mset?

It doesn't make much difference really.  If you're running queries with
the same sort, etc settings, it's probably simplest to reuse the same
Enquire object, but it's pretty cheap to tear down and rebuild so if
that fits your architecture better, there's no problem doing that.

> 5. Finally, is there a particular place in the documentation that I should
> be looking to find answers to questions such as these?

I don't think any of the above are answered in the documentation (not in
much detail anyway).  I'll try to slot the answers above into the docs
as time allows.

But generally the doxygen generated API documentation and the HTML
overview are probably the most helpful.  The wiki has a slowly growing
collection of useful information too.

Cheers,
    Olly



More information about the Xapian-discuss mailing list