[Xapian-devel] Explanation of how Eset works
aarshkshah1992 at gmail.com
Thu Jan 10 19:12:32 GMT 2013
He olly,thanks a lot for your detailed reply.So basically an ESET is formed
by ranking terms based on the combined weights((by using something similar
to BM25) assigned to the documents in the Rset (formed by the top 5 entries
in the MSET or selected by us ) which are present in the term's posting
list,right ? Cool.:) Read about relevance feedback from a textbook also
today,understand it well now,thanks :)
On Thu, Jan 10, 2013 at 3:42 PM, Olly Betts <olly at survex.com> wrote:
> On Thu, Jan 10, 2013 at 12:33:55PM +0530, aarsh shah wrote:
> > Hey guys hi.I am trying to understand how Xapian works .I read the
> > Theoretical Background to Xapian doc
> > and the report by Salton and Jones.I still cant seem to understand how
> > works
> When you do a search with the results sorted by relevance (the default),
> you give Xapian some terms (the query) and it gives you the document
> ids of the best documents according to a weighting formula which uses
> the terms you give it and how frequently they occur in each document and
> how many documents they occur in in total. The returned document ids
> are called the "match set" or "MSet" (though "set" is a bit misleading,
> as it is ordered).
> The process for forming the "ESet" is very similar, except that the
> roles of documents and terms are exchanged (and the mathematical
> formulae for the ESet are closely related to those for "TradWeight").
> So you pass in some document ids the user says are relevant (the "RSet",
> or "relevant set" - this one is unordered, so is actually a set in the
> mathematical sense) and using those documents and how frequently each
> term occurs in them compared to how frequently it occurs in the whole
> collection, Xapian computes the ESet.
> The ESet will tend to contain terms which occur more in the documents
> in the RSet than you would expect if terms were evenly spread through
> the collection.
> If you want to play around with it, you can see it on xapian.org -
> e.g. a search for "relevance" gives you a list of suggested expand
> terms in a green box:
> As you can see, it makes some good suggestions.
> These are produced by assuming the top 5 (IIRC) documents are relevant.
> You can select your own RSet by using the green checkboxes alongside
> each result - e.g. if I mark the first and third (which are API and
> internal docs for the RSet class) and click "Search" I get (URL
> simplified by hand to keep it short):
> You can see that the suggested terms are now more focussed on those
> related to the RSet API in particular.
> > How exactly does Xapian add terms to expand a query ?
> It's up to the developer using Xapian what they do with the terms in the
> ESet, but combining them with a query from the user is quite - in C++
> you can write:
> // Parse the query string from the user.
> Xapian::Query q = qp.parse_query(query_string);
> // OR together all the terms in the ESet.
> Xapian::Query eset_query(Xapian::Query::OP_OR, eset.begin(),
> // And OR together the two.
> q = Xapian::Query(Xapian::Query::OP_OR, q, eset_query);
> > Assuming we have a list of the k most important terms, how do we
> > decide which term to add to the query and will be in context with the
> > query ?
> I'm not sure quite what you're asking. The ESet is formed given an
> RSet, which is a set of document ids. That might come from searching
> using a query (as in the first xapian.org search above), but it doesn't
> have to. There's no direct connection between the query and the ESet
> > And to decide r in W(t)=rw(t), is r obtained from the user every time a
> > MSET is returned and is r for the term updated and stored for a term
> > time a query is performed ?
> No, r is "how many of the documents in the given RSet contain term t".
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Xapian-devel