[Xapian-discuss] related documents
Tim Brody
tdb2 at ecs.soton.ac.uk
Tue Jul 27 13:05:42 BST 2010
On Tue, 2010-07-27 at 05:13 +0100, Olly Betts wrote:
> On Mon, Jul 26, 2010 at 05:05:16PM +0100, Tim Brody wrote:
> > I would like to take a doc in the xapian DB and find all related
> > documents by relevance e.g. so when you view one document it says
> > "Related entries X Y Z".
> >
> > I'm aware of the "Morelikethis" Lucene plugin that is supposed to do
> > something like this, by generating a query from a document based on term
> > frequency.
> >
> > Has anyone developed a tool to generate a query from a document?
> > Is there a short-cut one can make with RSets?
>
> Omega's MORELIKE feature is implemented like so:
>
> Xapian::RSet tmprset;
> tmprset.add_document(docid);
> OmegaExpandDecider decider(db);
> Xapian::ESet eset(enquire->get_eset(40, tmprset, &decider));
> for (Xapian::ESetIterator i = eset.begin(); i != eset.end(); ++i) {
> // Handle term *i
> }
>
> If you want a query object, then you can just do:
>
> Xapian::RSet tmprset;
> tmprset.add_document(docid);
> OmegaExpandDecider decider(db);
> Xapian::ESet eset(enquire->get_eset(40, tmprset, &decider));
> Xapian::Query query(Xapian::Query::OP_OR, eset.begin(), eset.end());
>
> This picks up to 40 terms, favouring those which are relatively more common
> in the document than in the collection in general.
>
> The OmegaExpandDecider class filters the terms you are interested in - you
> can find that here:
>
> http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.h#L37
> http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.cc#L2242
Thanks Olly (5:30am?!).
I've attached a patch to add ExpandDecider support to
Enquire::get_eset() and a patch to fix some missing files in the
RPM .spec file.
Does this look correct (in Perl)?
my $rset = Search::Xapian::RSet->new();
$rset->add_document( $docid );
my $enq = Search::Xapian::Enquire->new( $db );
my $stopper = Search::Xapian::SimpleStopper->new();
... add some stop words
my $eset = $enq->get_eset( 40, $rset, sub {
my( $term ) = @_;
# Reject terms with a prefix
return 0 if $term =~ /^[A-Z]/;
# Don't suggest stopwords
return 0 if $stopper->stop_word( $term );
# Reject small numbers
return 0 if $term =~ /^[0-9]{1,3}$/;
# Reject terms containing a space
return 0 if $term =~ /\s/;
# Ignore terms that only occur once
return 0 if $db->get_termfreq( $term ) <= 1;
# Ignore any terms used in the original query
return 0 if grep { $term eq $_ } @query_terms;
return 1;
} );
my @terms = map { $_->get_termname() } $eset->items;
$enq = Search::Xapian::Enquire->new( $xapian );
$enq->set_query(
Search::Xapian::Query->new(
Search::Xapian::OP_OR(),
@terms
),
);
my @docs = $enq->get_mset( 0, 10, $rset )->items;
/Tim.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eset_patch.diff
Type: text/x-patch
Size: 2889 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20100727/75f42c22/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spec_metadata_cmake_patch.diff
Type: text/x-patch
Size: 917 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20100727/75f42c22/attachment-0001.bin>
More information about the Xapian-discuss
mailing list