[Xapian-discuss] related documents

Tim Brody tdb2 at ecs.soton.ac.uk
Tue Jul 27 13:05:42 BST 2010


On Tue, 2010-07-27 at 05:13 +0100, Olly Betts wrote:
> On Mon, Jul 26, 2010 at 05:05:16PM +0100, Tim Brody wrote:
> > I would like to take a doc in the xapian DB and find all related
> > documents by relevance e.g. so when you view one document it says
> > "Related entries X Y Z".
> > 
> > I'm aware of the "Morelikethis" Lucene plugin that is supposed to do
> > something like this, by generating a query from a document based on term
> > frequency.
> > 
> > Has anyone developed a tool to generate a query from a document?
> > Is there a short-cut one can make with RSets?
> 
> Omega's MORELIKE feature is implemented like so:
> 
>     Xapian::RSet tmprset;
>     tmprset.add_document(docid);
>     OmegaExpandDecider decider(db);
>     Xapian::ESet eset(enquire->get_eset(40, tmprset, &decider));
>     for (Xapian::ESetIterator i = eset.begin(); i != eset.end(); ++i) {
> 	// Handle term *i
>     }
> 
> If you want a query object, then you can just do:
> 
>     Xapian::RSet tmprset;
>     tmprset.add_document(docid);
>     OmegaExpandDecider decider(db);
>     Xapian::ESet eset(enquire->get_eset(40, tmprset, &decider));
>     Xapian::Query query(Xapian::Query::OP_OR, eset.begin(), eset.end());
> 
> This picks up to 40 terms, favouring those which are relatively more common
> in the document than in the collection in general.
> 
> The OmegaExpandDecider class filters the terms you are interested in - you
> can find that here:
> 
> http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.h#L37
> http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.cc#L2242

Thanks Olly (5:30am?!).

I've attached a patch to add ExpandDecider support to
Enquire::get_eset() and a patch to fix some missing files in the
RPM .spec file.

Does this look correct (in Perl)?

    my $rset = Search::Xapian::RSet->new();
    $rset->add_document( $docid );

    my $enq = Search::Xapian::Enquire->new( $db );

    my $stopper = Search::Xapian::SimpleStopper->new();
    ... add some stop words
    my $eset = $enq->get_eset( 40, $rset, sub {
        my( $term ) = @_;

        # Reject terms with a prefix
        return 0 if $term =~ /^[A-Z]/;

        # Don't suggest stopwords
        return 0 if $stopper->stop_word( $term );

        # Reject small numbers
        return 0 if $term =~ /^[0-9]{1,3}$/;

        # Reject terms containing a space
        return 0 if $term =~ /\s/;

        # Ignore terms that only occur once
        return 0 if $db->get_termfreq( $term ) <= 1;

        # Ignore any terms used in the original query
        return 0 if grep { $term eq $_ } @query_terms;

        return 1;
    } );

    my @terms = map { $_->get_termname() } $eset->items;

    $enq = Search::Xapian::Enquire->new( $xapian );
    $enq->set_query(
            Search::Xapian::Query->new(
                Search::Xapian::OP_OR(),
                @terms
            ),
    );
    my @docs = $enq->get_mset( 0, 10, $rset )->items;


/Tim.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eset_patch.diff
Type: text/x-patch
Size: 2889 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20100727/75f42c22/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spec_metadata_cmake_patch.diff
Type: text/x-patch
Size: 917 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20100727/75f42c22/attachment-0001.bin>


More information about the Xapian-discuss mailing list