[Xapian-discuss] get_data not fast enough for query matches

Sat Feb 4 06:00:43 GMT 2006

On Thu, Feb 02, 2006 at 04:00:29PM +0000, Salem Berhanu wrote:
> Basically, I want users to be able to search different parts of a document. 
> For instance I want them to be able to search a title that contains the 
> term 'data compression' and in the description 'Rate-Distortion theory'. 

Don't index different fields of a document into different databases if
you want to be able to search them together - there's no good reason to
and it just means you have to mess around merging the results from
multiple searches.

Instead prefix terms generated from additional fields, as James
suggested.

> This is the main reason I'm using several dbs. In addition I read that it's 
> better to have smaller dbs for better performance. (Maybe it's wrong)

If you have lots of documents, it's faster to index into several smaller
databases and merge them afterwards than to index into a single database
(what "lots" is depends on the hardware and the nature of the data, but
to give an idea, I index gmane's 30+ million messages in 1 million
document chunks).

However, it's probably a bit slower to search several smaller databases
(which is why merging is recommended).  If you commonly want to search
a subset of the data, then keeping that data in a separate database
is likely to be a win over a boolean filter on a merged database.

But if you want to split over several databases, split between documents
(e.g. put a million documents into each database) rather than trying to
put different fields into different databases.

> I don't actually run out of space when I grab the data, it just takes a 
> long time. For instance I wrote a small query script to search for a term, 
> let me know how many matches it finds and then loops throught the match 
> getting the data. I search for the word theory in description, within the 
> first 7 seconds it tells me it found 137480 which is good but then it takes 
> 2m15s to grab the data for each match.

We don't expect people to want all the results of a search that matches
so many documents, so I'm not suprised that this isn't lightning fast.

You're forcing the matcher to avoid most of its possible optimisations
(which is probably why the search takes 7 seconds), and then you're
retrieving lots of entries from the record table, which has been
designed with the expectation that you'll want more like 10-1000
results.

I'm guessing you're only trying to get all the results so you can merge
the results from searching two fields in different databases, in which
case this ceases to be an issue if you use term prefixes instead.  If
I'm wrong, please explain *WHY* you want all 137480 matches.

Cheers,
    Olly