[Xapian-discuss] Newbie problems with searching from Python

Richard Lewis richardlewis at fastmail.co.uk
Mon Apr 27 18:26:24 BST 2009


Hi there,

I'm brand new to Xapian and trying to use it to add a full-text search
facility to my Web-published database using Python. I've developed an
API which gives me XML views of the database records I need to
index. I also have XSLT stylesheets which transform those XML views
into HTML for Web presentation.

What I'm trying to do is build a Xapian index of my HTML documents and
provide a simple key-word search interface to that index. Both the
indexing operation and the searching operation need to be called in
response to HTTP requests on an always-alive server (CherryPy, in
fact) and so a Python bindings-based solution is preferable to using
an external application (such as Omega) run in a separate process.

So far, I have the following:

import xapian

# some document 'value' (or metadata) constants
DOC_PATH = 0
DOC_RECORD_TYPE = 1
DOC_CATNO = 2
DOC_TITLE = 3
DOC_SUBTITLE = 4
DOC_YEAR = 5

def build_fulltext_index():
    # initialise the Xapian indexer
    database = xapian.WritableDatabase('indexes', xapian.DB_CREATE_OR_OPEN)
    indexer = xapian.TermGenerator()
    stemmer = xapian.Stem('english')
    indexer.set_stemmer(stemmer)

    for work in works_table.list_records():
        work_html = html_XSLT(work.xml())

        # create a Xapian document
        doc = xapian.Document()

        # set its properties to the properties of the work
        doc.add_value(DOC_PATH, '/works/%s' % work['catalogue_no'])
        doc.add_value(DOC_RECORD_TYPE, work_html.xpath('//meta[@name="record-type"]')[0].attrib['content'])
        doc.add_value(DOC_CATNO, work['catalogue_no'])
        doc.add_value(DOC_TITLE, work['title'])
        doc.add_value(DOC_SUBTITLE, work['subtitle'])
        doc.add_value(DOC_YEAR, work['year'])

        # set the HTML version of the work as the Xapian
        # document's data
        doc.set_data(etree.tostring(work_html))

	# index all the text inside the HTML <dic class="work">
	# element
        indexer.set_document(doc)
        indexer.index_text('\n'.join(work_html.getroot().xpath('//div[@class="work"]//text()')))

        # add the document to the database
        if doc.get_docid() == 0:
            print '/works/%s has docid of 0' % work['catalogue_no']
        else:
            database.replace_document(doc.get_docid(), doc)

def search(terms):
    # load the index and initialise the query
    database = xapian.Database('indexes')
    enquire = xapian.Enquire(database)
    qp = xapian.QueryParser()
    stemmer = xapian.Stem('english')
    qp.set_stemmer(stemmer)
    qp.set_database(database)
    qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
    query = qp.parse_query(terms)

    # execute the query
    enquire.set_query(query)
    matches = enquire.get_mset(start - 1, count)

    # iterate over the results
    for m in matches:
        # retrieve the document
        document = m.document

        # get the catalogue number and record-type of the hit
        (cat_no, record_type) = (document.get_value(Search.DOC_CATNO), document.get_value(Search.DOC_RECORD_TYPE))


The above code seems to generate the index OK. And it also manages
storing and retrieving the metadata (like titles, catalogue numbers,
etc.) However, I don't get anything like the number of hits I'd expect
for any given search. Generally, I get two or three hits for the
search terms I try. In all cases I know that 10s of records in my
database match the terms I'm testing. (I used to use Swish-e for my
indexing and it returned the kinds of results I'm expecting to see.)

So I don't really know what to ask. Does anyone know what I'm doing
wrong? Or is Xapian behaving as it should and I'm just expecting the
wrong thing of it?

Cheers,
Richard



More information about the Xapian-discuss mailing list