[Xapian-discuss] Re: searching and sorting by date

Sat Mar 25 14:43:41 GMT 2006

On Fri, Mar 24, 2006 at 12:13:54PM -0800, Michel Pelletier wrote:

> >Querying is more complex than it needs to be. We could easily add
> >something into the bindings so you could do:
> >
> >----------------------------------------------------------------------
> >for match in database.query("my search terms):
> >    # do something
> >    pass
> >----------------------------------------------------------------------
> 
> I think that would be very useful.

Attached a patch against svn HEAD - should work against 0.9.4 as well,
I think - which also adds documentation for the pythonic iterators.

[prefix mapping]
> Xapwrap manages those mappings for you.  That's one of the really nice 
> things it does out of the box.  When a document is indexed, the keys of 
> "Keywords" are remembered in a dictionary and the query parser is 
> automatically configured with the appropriate prefixes.  You can either 
> save/restore the mapping from a dictionary (which I use), or Xapwrap has 
> support for storing its metadata in document==1 in the xapian database.

Neat.

> By default in Xapwrap it just prints the score and document id.  You can 
> access values of the document form the result, so if title was an 
> existing value:
> 
>     print result['values']['title']
> 
> would print the title of the matching document.

I assume that "value" in Xapwrap corresponds to "data field" in
Xapian. This was a little confusing a while ago, when we renamed half
the concepts around documents - fields are strings with arbitrary
names, while values are integer-numbered strings. (Values are of very
specific use; you tend to use fields when displaying stuff about a
document.)

By default, if you do:

----------------------------------------------------------------------
for match in database.query("query string here", language="en"):
    print match
----------------------------------------------------------------------

with the work I've done (both on pythonic iterators and yesterday on
the query() and enquire() methods to the Database object) you'll get
something slightly unhelpful:

----------------------------------------------------------------------
[12513, 0.0, 0, 100, <xapian.Document; proxy of <Swig Object of type
'Xapian::Document *' at 0x8199540> >]
----------------------------------------------------------------------

It wouldn't be difficult to make the pythonic iterators return
appropriate objects rather than lists, but I don't know how far we
want to go here. In order, the list above gives docid, weight, rank,
percentage and then the Document object itself - thoughts on whether
this is the right approach welcome. If you wanted to print the title,
you'd need quite a bit more work (as Xapian itself doesn't define what
the Document data is used for, and the bindings don't do anything
automatically). You'd end up with something like:

----------------------------------------------------------------------
# given result is an element of the MSet
print (filter(lambda x: x.split("=")[0]=="Title", \
      result[4].get_data().split("\n")))[0].split("=")[1]
----------------------------------------------------------------------

Really not recommended! However given that omega provides such a
convenient test query interface, and scriptindex provides such a
convenient way of building databases, I wonder if it might not be
worth having some extensions to the python bindings to work with that
view of data. It wouldn't be difficult to add methods to allow:

----------------------------------------------------------------------
print result[4]['Title']
----------------------------------------------------------------------

I don't want to implement this without further discussion on the list,
however - it might be better done in core, or better left to a wrapper.

> While the term generation explanation makes sense once it's 
> explained, it's a tough concept for a new user to jump over right away.

That's why we tend to encourage most people to use scriptindex, which
takes care of things for you. If Xapwrap is doing the same term
generation (and it sounds like it is) then that's a neat solution for
people who want a little more without digging really deep.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org
-------------- next part --------------
Index: python/docs/bindings.html
===================================================================

--- python/docs/bindings.html	(revision 6674)
+++ python/docs/bindings.html	(working copy)
@@ -68,6 +68,44 @@
    Other methods, such as <code>MSetIterator.get_document()</code>, are
    available unchanged.
 </p>
+
+<h2>Pythonic iterators</h2>
+
+<p>
+Many classes that support C++-style iterators also support Pythonic
+iterators which do the same thing in a Python style. The following are
+supported (where marked as default iterator, it means __iter__() does the right thing so you can for instance use <code>for term in document</code> to iterate over terms in the Document):
+</p>
+
+<table title='Python iterators'>
+<thead><td>Class</td><td>Method</td><td>Equivalent to</td><td>Iterator type</td></thead>
+<tr><td><code>MSet</code></td><td>default iterator</td><td><code>begin()</code></td><td><code>MSetIter</code></td></tr>
+<tr><td><code>ESet</code></td><td>default iterator</td><td><code>begin()</code></td><td><code>ESetIter</code></td></tr>
+<tr><td><code>Enquire</code></td><td><code>matching_terms()</code></td><td><code>get_matching_terms_begin()</code></td><td><code>TermIter</code></td></tr>
+<tr><td><code>Query</code></td><td>default iterator</td><td><code>get_terms_begin()</code></td><td><code>TermIter</code></td></tr>
+<tr><td><code>Database</code></td><td><code>allterms()</code></td><td><code>allterms_begin()</code> (also as default iterator)</td><td><code>TermIter</code></td></tr>
+<tr><td><code>Database</code></td><td><code>postlist(tname)</code></td><td><code>postlist_begin(tname)</code></td><td><code>PostingIter</code></td></tr>
+<tr><td><code>Database</code></td><td><code>termlist(docid)</code></td><td><code>termlist_begin(docid)</code></td><td><code>TermIter</code></td></tr>
+<tr><td><code>Database</code></td><td><code>positionlist(docid, tname)</code></td><td><code>positionlist_begin(docid, tname)</code></td><td><code>PositionIter</code></td></tr>
+<tr><td><code>Document</code></td><td><code>values()</code></td><td><code>values_begin()</code></td><td><code>ValueIter</code></td></tr>
+<tr><td><code>Document</code></td><td><code>termlist()</code></td><td><code>termlist_begin()</code> (also as default iterator)</td><td><code>TermIter</code></td></tr>
+<tr><td><code>QueryParser</code></td><td><code>stoplist()</code></td><td><code>stoplist_begin()</code></td><td><code>TermIter</code></td></tr>
+<tr><td><code>QueryParser</code></td><td><code>unstemlist(tname)</code></td><td><code>unstem_begin(tname)</code></td><td><code>TermIter</code></td></tr>
+</table>
+
+<p>
+The Pythonic iterators will all return lists representing the appropriate item when their <code>next()</code> method is called, except PositionIter which just returns a single value:
+</p>
+
+<table>
+<thead><td>Class</td><td>Returns</td></thead>
+<tr><td><code>MSetIter</code></td><td>[docid, weight, rank, percentage, document]</td></tr>
+<tr><td><code>ESetIter</code></td><td>[termname, weight]</td></tr>
+<tr><td><code>TermIter</code></td><td>[term, wdf, termfreq, position iterator]</td></tr>
+<tr><td><code>PostingIter</code></td><td>[docid, doclength, wdf, position iterator]</td></tr>
+<tr><td><code>PositionIter</code></td><td>termpos</td></tr>
+<tr><td><code>ValueIter</code></td><td>[valueno, value]</td></tr>
+</table>
    
 <h2>MSet</h2>
 
@@ -112,6 +150,34 @@
 <tr><td><code>xapian.ESET_WT</code></td><td>Weight</td></tr>
 </table>
 
+<h2>Database query functions</h2>
+
+<p>
+The Database has two methods, <code>enquire()</code> and <code>query()</code> which respectively build an Enquire object and build an Enquire object and then get the MSet. They accept the following parameters, all except <code>querystring</code> being optional:
+</p>
+
+<dl>
+<dt><code>querystring</code></dt>
+<dd>The textual query to be parsed</dd>
+
+<dt><code>first</code></dt>
+<dd>First item in the result set to return, 0 being the first item and the default [<code>query()</code> only]</dd>
+
+<dt><code>maxitems</code></dt>
+<dd>Maximum number of items to return, defaulting to 10 [<code>query()</code> only]</dd>
+
+<dt><code>flags</code></dt>
+<dd>Flags to pass to <code>QueryParser::parser_query()</code>, defaulting to 0 (no flags)</dd>
+
+<dt><code>language</code></dt>
+<dd>Language to use for stemming, defaulting to None (no stemming)</dd>
+
+<dt><code>strategy</code></dt>
+<dd>Stemming strategy to use, defaulting to STEM_NONE</dd>
+
+<dt><code>queryparser</code></dt>
+<dd><code>QueryParser</code> object to use - if you set this, <code>language</code> and <code>strategy</code> are ignored</dd>
+
 <h2>Database Factory Functions</h2>
 
 <ul>
Index: python/extra.i
===================================================================
--- python/extra.i	(revision 6674)
+++ python/extra.i	(working copy)
@@ -1,7 +1,7 @@
 %{
 /* python/extra.i: Xapian scripting python interface additional code.
  *
- * Copyright (C) 2003,2004,2005 James Aylett
+ * Copyright (C) 2003,2004,2005,2006 James Aylett
  * Copyright (C) 2005,2006 Olly Betts
  *
  * This program is free software; you can redistribute it and/or
@@ -170,6 +170,29 @@
 Database.termlist = database_gen_termlist_iter
 Database.positionlist = database_gen_positionlist_iter
 
+def database_enquire(self, querystring, flags=0, language=None, strategy=None, queryparser=None):
+    enq = Enquire(self)
+    if queryparser==None:
+        queryparser = QueryParser()
+	if language!=None:
+	    queryparser.set_stemmer(Stem(language))
+	    if strategy==None:
+	        strategy = QueryParser.STEM_SOME
+        if strategy==None:
+            queryparser.set_stemming_strategy(QueryParser.STEM_NONE)
+        else:
+            queryparser.set_stemming_strategy(strategy)
+    q = queryparser.parse_query(querystring, flags)
+    enq.set_query(q)
+    return enq
+
+def database_query(self, querystring, first=0, maxitems=10, flags=0, language=None, strategy=None, queryparser=None):
+    enq = self.enquire(querystring, flags, language, strategy, queryparser)
+    return enq.get_mset(first, maxitems)
+
+Database.enquire = database_enquire
+Database.query = database_query
+
 def document_gen_termlist_iter(self):
     return TermIter(self.termlist_begin(), self.termlist_end(), TermIter.HAS_POSITIONS)
 def document_gen_values_iter(self):