[Xapian-discuss] what is the fastest way to fetch results which are sorted by timestamp ?

Thu Aug 11 12:17:35 BST 2011

On 11 August 2011 11:18, Henry C. <henka at cityweb.co.za> wrote:
> It's a real pity xapian-compact doesn't have a --sort-by-value argument to
> perform post-indexing basic sorting of some kind.
>
> Is something like this even possible (by that I mean a change to
> xapian-compact code)?

It's not really possibly to do sorting (or other reordering of docids)
during the process that xapian-compact performs; it's working at a
lower level than that, stitching chunks of postlists together without
actually interpreting their contents.

I think it would theoretically be possible to write something which
produced a mapping from old docid to new docid, and then ran through
all the posting lists in turn, reordering them according to this
mapping (and also reordered the termlist, positionlist and record
databases).  The obvious implementation of this would need to hold the
whole of each postlist in memory, but that would probably be okay in
most situations; even if a postlist had 1000 million entries, that
would fit into the memory of modern servers.  This wouldn't be a
simple modification of xapian-compact, though; it would be a whole new
tool.

Other than that being implemented, to sort the database you really
need to work at pretty much the level of the xapian database API; ie,
implement something more like the copydatabase tool, which copies
documents in the new order.  I've written code (in python) to sort
databases using this method in the past - which worked ok for a few
million documents, but isn't particularly efficient.  I don't have the
rights to distribute that code, but it was pretty simple.  If I
remember correctly, it pulled the values to sort by into a numpy
array, and used one of numpy's functions to produce a mapping from the
old docid to the new docid, and then just ran through the old database
reading documents and writing them to the new database in the correct
position.

> In our case we're indexing hundreds of millions of docs to several hundred
> thousand sub-indexes, then merging those to hundreds of indexes which are
> searched against.  Keeping all that sorted (by custom "PageRank" value) during
> index-time to maximise sorted-result performance is nigh on impossible.

Indeed.  What might be more possible would be to partition the
documents by pagerank value into several shards; so that low values go
into one bucket, middle into another, high into another still.  Then,
it should be possible to merge the databases keeping these shards
together and in decreasing weight order (by assigning document IDs in
particular ranges to each shard, and using the option of
xapian-compact that preserves docids), and to keep track of the
maximum value.  You can then use a custom PostingSource to take
advantage of the knowledge of the maximum docids in each range of
document IDS in the output database, allowing the search to terminate
early in many cases.

Sorry, that's not a very coherent explanation - it's another technique
that I've experimented in the past with, with some success, but it's
again quite tricky to implement.

-- 
Richard