[Xapian-discuss] Always returning ALL the documents matching a query

Olly Betts olly at survex.com
Sat Jan 10 12:48:00 GMT 2009

On Mon, Dec 29, 2008 at 02:26:37PM -0500, tata 668 wrote:
> If I want to manage the sorting outside Xapian, I have to get all the
> posts resulting of the search in the first place.

The essential problem with getting all the results is that doing so is
inherently at best O(total_number_of_results) in both time and memory.
And that can easily end up being O(total_number_of_documents) - e.g.
searching for "the" if the documents are in English.  In fact the time
complexity will likely be O(N*log(N)) because we always perform a sort
currently (this could potentially be avoided in some cases when the
results are being ordered only by document id since the posting lists
are sorted by document id).

If you work with the features Xapian provides and only request the
number of documents you actually want to display then the matcher only
needs to keep the best n documents seen so far (where n = mset_size +
offset) which is a lot less memory for most queries.  Also various
optimisation tricks which the matcher knows allow us to terminate early
in many cases, so the time required is often much less (exactly how well
these work depends on the query, the various options used, and the
nature of the data).

So using Xapian as a simple "source of document ids" is really making it
operate with its hands tied behind its back.  It'll do what you ask
(unless you run out of memory), but you would do better to use the
sorting and filtering facilities provided rather than ignoring them and
implementing your own externally.

If you don't have many documents, you can probably get away with not
worrying about efficiency, especially if you don't have a high query
rate either.  But if you're handling large numbers of documents, you may
run into problems, and small systems have a habit of growing...


More information about the Xapian-discuss mailing list