[Xapian-discuss] performance on document.get_data()

Wed Oct 30 00:02:08 GMT 2013

On Wed, Oct 23, 2013 at 01:30:51PM +0800, Tong Liu wrote:
> I got some performance issue for document.get_data() and
> enquire.get_mset(). It costs 35 seconds for matches =
> enquire.get_mset(0,200), and 3 seconds for iterating all doc in matches to
> get_data. Is't normal? My index contains 30millions documents. I use python
> binding to operate xapian. Bellow it's my index structure
> # value: 0:date, 1:site
> # data: json message which contains: author, url, message(30 words)

That sounds much slower than I'd expect.  Is that the cold cache time?
If so, does rerunning the same query take much less time?

> Do you have any idea to improve the search performance , especially
> doc.get_data?
> 
> my code snippet
> 
> database = xapian.Database("%s/athena" % DATA_PATH)
> enquire = xapian.Enquire(database)
> enquire.set_weighting_scheme(xapian.BM25Weight())
> query = parse(keywords)

What are you passing in for keywords here?

> enquire.set_query(query)
> matches = enquire.get_mset(start, 200)

Is start 0 here?

> matches.fetch()

With a local database, it probably won't help to call fetch().

> result = [json.loads(match.document.get_data()) for match in matches]

So your time includes parsing the JSON - try changing that to this to
focus on the time actually taken by Xapian and its python bindings:

  result = [match.document.get_data() for match in matches]

Also, what Xapian version are you using?

Cheers,
    Olly