[Xapian-discuss] PHP termpos issue

Sat Aug 25 22:33:43 BST 2007

Hi Olly,

I have a set of documents and I'm using Xapian to retrieve relevant ones
based on search terms, and I thought it would be useful to display the text
surrounding the terms found in each document on my results page (in short,
display context). What I was going to do was collect all of the position
information, and for queries including more than one term and for each
document which matches more than one of those terms, output maybe five words
on either side of the terms on my results page, using the instances of the
different terms which are closest together geographically in the document as
the ones to display (i.e. if the two terms searched were "Zdivest" and
"sudan," and the positions were 25 and 100, and 28 and 60, respectively, I
would output the text from position 20 to position 33 on the results page).
Then for one-term queries or for documents which matched only one of the
terms, I would just output the first instance. (I kept the actual full-text
of the document, by "term," in a RDBMS with matching positional identifiers,
so even though the term match might be "Zdivest" the RDBMS entry would read
" divestment") Since most queries use the stemmed forms, I needed to have
that positional information for this to work.

Now, I'm thinking I might split the documents by sentence (or perhaps
clause, if I can figure out a reliable method of doing so), and then insert
these sentences into a separate Xapian database as documents, then run the
same query terms on this sentence database to find the most relevant
sentences in the true document. Then I could just measure the length of the
results and return, say, the most relevant 200 characters worth of content,
in order. I feel like this would likely be more useful, plus I've been
impressed by the speed of Xapian, so I'm not too concerned about the
multiple queries. I'll be interested to see how Xapian performs with very
small documents (e.g. 3-5 terms each), as my true documents have
non-sentence elements which will probably end up getting split into pretty
small pieces by the parser I write.

In any event, this is probably more information than anyone will ever care
to know :) One thing I was wondering is whether the MatchDecider and
ExpandDecider frameworks (I don't know any of the terminology; I just mess
around with this sort of stuff) are implemented in the PHP bindings. It
didn't look like Enquire::get_mset() took a MatchDecider object, and I
couldn't find a definition from a XapianMatchDecider in the include file.
Same thing with getting an eset. If these aren't implemented, I guess I'll
just write something in C++ and use exec() to call it for the eset at least,
since I can just have that return a string of terms. But is there a list of
which bindings support which parts of the API? That would be useful.

Thanks for all of your help,
-Mike

On 8/25/07, Olly Betts <olly at survex.com> wrote:
>
> On Thu, Aug 23, 2007 at 02:40:38PM -0400, Mike Ragalie wrote:
> > The latter formulation is working for me now; I didn't realize that the
> > TermGenerator doesn't attach position information to stemmed terms.
>
> The reasoning behind this is that it helps keep the database size down.
> Phrases most naturally work on unstemmed terms and it's reasonable
> enough that NEAR, etc do too.
>
> Why are you after positonal information for the stemmed terms?  It could
> perhaps be made an option for TermGenerator.
>
> Cheers,
>     Olly
>
>