[Xapian-discuss] Newbie problems with searching from Python

Mon Apr 27 18:26:24 BST 2009

Hi there,

I'm brand new to Xapian and trying to use it to add a full-text search
facility to my Web-published database using Python. I've developed an
API which gives me XML views of the database records I need to
index. I also have XSLT stylesheets which transform those XML views
into HTML for Web presentation.

What I'm trying to do is build a Xapian index of my HTML documents and
provide a simple key-word search interface to that index. Both the
indexing operation and the searching operation need to be called in
response to HTTP requests on an always-alive server (CherryPy, in
fact) and so a Python bindings-based solution is preferable to using
an external application (such as Omega) run in a separate process.

So far, I have the following:

import xapian

# some document 'value' (or metadata) constants
DOC_PATH = 0
DOC_RECORD_TYPE = 1
DOC_CATNO = 2
DOC_TITLE = 3
DOC_SUBTITLE = 4
DOC_YEAR = 5

def build_fulltext_index():
    # initialise the Xapian indexer
    database = xapian.WritableDatabase('indexes', xapian.DB_CREATE_OR_OPEN)
    indexer = xapian.TermGenerator()
    stemmer = xapian.Stem('english')
    indexer.set_stemmer(stemmer)

    for work in works_table.list_records():
        work_html = html_XSLT(work.xml())

        # create a Xapian document
        doc = xapian.Document()

        # set its properties to the properties of the work
        doc.add_value(DOC_PATH, '/works/%s' % work['catalogue_no'])
        doc.add_value(DOC_RECORD_TYPE, work_html.xpath('//meta[@name="record-type"]')[0].attrib['content'])
        doc.add_value(DOC_CATNO, work['catalogue_no'])
        doc.add_value(DOC_TITLE, work['title'])
        doc.add_value(DOC_SUBTITLE, work['subtitle'])
        doc.add_value(DOC_YEAR, work['year'])

        # set the HTML version of the work as the Xapian
        # document's data
        doc.set_data(etree.tostring(work_html))

	# index all the text inside the HTML <dic class="work">
	# element
        indexer.set_document(doc)
        indexer.index_text('\n'.join(work_html.getroot().xpath('//div[@class="work"]//text()')))

        # add the document to the database
        if doc.get_docid() == 0:
            print '/works/%s has docid of 0' % work['catalogue_no']
        else:
            database.replace_document(doc.get_docid(), doc)

def search(terms):
    # load the index and initialise the query
    database = xapian.Database('indexes')
    enquire = xapian.Enquire(database)
    qp = xapian.QueryParser()
    stemmer = xapian.Stem('english')
    qp.set_stemmer(stemmer)
    qp.set_database(database)
    qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
    query = qp.parse_query(terms)

    # execute the query
    enquire.set_query(query)
    matches = enquire.get_mset(start - 1, count)

    # iterate over the results
    for m in matches:
        # retrieve the document
        document = m.document

        # get the catalogue number and record-type of the hit
        (cat_no, record_type) = (document.get_value(Search.DOC_CATNO), document.get_value(Search.DOC_RECORD_TYPE))

The above code seems to generate the index OK. And it also manages
storing and retrieving the metadata (like titles, catalogue numbers,
etc.) However, I don't get anything like the number of hits I'd expect
for any given search. Generally, I get two or three hits for the
search terms I try. In all cases I know that 10s of records in my
database match the terms I'm testing. (I used to use Swish-e for my
indexing and it returned the kinds of results I'm expecting to see.)

So I don't really know what to ask. Does anyone know what I'm doing
wrong? Or is Xapian behaving as it should and I'm just expecting the
wrong thing of it?

Cheers,
Richard