[Xapian-discuss] Newbie problems with searching from Python
Richard Lewis
richardlewis at fastmail.co.uk
Mon Apr 27 18:26:24 BST 2009
Hi there,
I'm brand new to Xapian and trying to use it to add a full-text search
facility to my Web-published database using Python. I've developed an
API which gives me XML views of the database records I need to
index. I also have XSLT stylesheets which transform those XML views
into HTML for Web presentation.
What I'm trying to do is build a Xapian index of my HTML documents and
provide a simple key-word search interface to that index. Both the
indexing operation and the searching operation need to be called in
response to HTTP requests on an always-alive server (CherryPy, in
fact) and so a Python bindings-based solution is preferable to using
an external application (such as Omega) run in a separate process.
So far, I have the following:
import xapian
# some document 'value' (or metadata) constants
DOC_PATH = 0
DOC_RECORD_TYPE = 1
DOC_CATNO = 2
DOC_TITLE = 3
DOC_SUBTITLE = 4
DOC_YEAR = 5
def build_fulltext_index():
# initialise the Xapian indexer
database = xapian.WritableDatabase('indexes', xapian.DB_CREATE_OR_OPEN)
indexer = xapian.TermGenerator()
stemmer = xapian.Stem('english')
indexer.set_stemmer(stemmer)
for work in works_table.list_records():
work_html = html_XSLT(work.xml())
# create a Xapian document
doc = xapian.Document()
# set its properties to the properties of the work
doc.add_value(DOC_PATH, '/works/%s' % work['catalogue_no'])
doc.add_value(DOC_RECORD_TYPE, work_html.xpath('//meta[@name="record-type"]')[0].attrib['content'])
doc.add_value(DOC_CATNO, work['catalogue_no'])
doc.add_value(DOC_TITLE, work['title'])
doc.add_value(DOC_SUBTITLE, work['subtitle'])
doc.add_value(DOC_YEAR, work['year'])
# set the HTML version of the work as the Xapian
# document's data
doc.set_data(etree.tostring(work_html))
# index all the text inside the HTML <dic class="work">
# element
indexer.set_document(doc)
indexer.index_text('\n'.join(work_html.getroot().xpath('//div[@class="work"]//text()')))
# add the document to the database
if doc.get_docid() == 0:
print '/works/%s has docid of 0' % work['catalogue_no']
else:
database.replace_document(doc.get_docid(), doc)
def search(terms):
# load the index and initialise the query
database = xapian.Database('indexes')
enquire = xapian.Enquire(database)
qp = xapian.QueryParser()
stemmer = xapian.Stem('english')
qp.set_stemmer(stemmer)
qp.set_database(database)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
query = qp.parse_query(terms)
# execute the query
enquire.set_query(query)
matches = enquire.get_mset(start - 1, count)
# iterate over the results
for m in matches:
# retrieve the document
document = m.document
# get the catalogue number and record-type of the hit
(cat_no, record_type) = (document.get_value(Search.DOC_CATNO), document.get_value(Search.DOC_RECORD_TYPE))
The above code seems to generate the index OK. And it also manages
storing and retrieving the metadata (like titles, catalogue numbers,
etc.) However, I don't get anything like the number of hits I'd expect
for any given search. Generally, I get two or three hits for the
search terms I try. In all cases I know that 10s of records in my
database match the terms I'm testing. (I used to use Swish-e for my
indexing and it returned the kinds of results I'm expecting to see.)
So I don't really know what to ask. Does anyone know what I'm doing
wrong? Or is Xapian behaving as it should and I'm just expecting the
wrong thing of it?
Cheers,
Richard
More information about the Xapian-discuss
mailing list