[Xapian-discuss] PHP termpos issue

Olly Betts olly at survex.com
Sun Sep 2 20:40:09 BST 2007


On Sat, Aug 25, 2007 at 05:33:43PM -0400, Mike Ragalie wrote:
> I have a set of documents and I'm using Xapian to retrieve relevant ones
> based on search terms, and I thought it would be useful to display the text
> surrounding the terms found in each document on my results page (in short,
> display context).

That is a nice feature.  As I've said before, I think Xapian should
provide support for it in some way, but it doesn't currently.

> What I was going to do was collect all of the position
> information, and for queries including more than one term and for each
> document which matches more than one of those terms, output maybe five words
> on either side of the terms on my results page, using the instances of the
> different terms which are closest together geographically in the document as
> the ones to display

That makes sense.  Perhaps it's worthwhile favouring sentence boundaries
(i.e. output less than five words if you find one first, maybe output an
extra word or two if that gets you to one).

> Now, I'm thinking I might split the documents by sentence (or perhaps
> clause, if I can figure out a reliable method of doing so)

I've heard it can be tricky to reliably identify even sentence boundaries
for arbitrary text on the web!

> I'll be interested to see how Xapian performs with very
> small documents (e.g. 3-5 terms each), as my true documents have
> non-sentence elements which will probably end up getting split into pretty
> small pieces by the parser I write.

If the weighting isn't good, you might find tweaking the weighting
scheme parameters helps.

If you get something which works well, I'd certainly be interested to
consider it for integrating as a standard feature.

> One thing I was wondering is whether the MatchDecider and
> ExpandDecider frameworks (I don't know any of the terminology; I just mess
> around with this sort of stuff) are implemented in the PHP bindings. It
> didn't look like Enquire::get_mset() took a MatchDecider object, and I
> couldn't find a definition from a XapianMatchDecider in the include file.

SWIG doesn't currently support "directors" for PHP, so we can't allow
PHP subclasses of wrapped C++ classes to be passed to C++ methods.
This means that MatchDecider and ExpandDecider aren't wrapped for PHP
(in SVN, there are some standard C++ MatchDecider subclasses, but you
still can't create your own PHP subclass).

> Same thing with getting an eset. If these aren't implemented, I guess I'll
> just write something in C++ and use exec() to call it for the eset at least,
> since I can just have that return a string of terms.

Depends if you need to filter the terms - if not, you don't need an
ExpandDecider.  Another approach would be to include some standard
ExpandDecider subclasses (e.g. return only terms with a given prefix)
which we could then wrap for PHP, etc.  That would avoid a call from C++
to the scripting language, so would be useful for other languages too.

> But is there a list of which bindings support which parts of the API?

Any unwrapped functionality should be covered in the appropriate
bindings.html file.  If you find omissions, please let us know.

Cheers,
    Olly



More information about the Xapian-discuss mailing list