[Xapian-discuss] Emulating SQL JOIN in xapian

Simon Roe simon.roe at talusdesign.co.uk
Wed Feb 25 21:20:07 GMT 2009


I'm trying to do something that probably shouldn't be done in xapian,
but for various reasons I'd like to try.

I have a basic relational database consisting of payments to
recipients and recipients, with global recipient IDs (what would be a
'unique key' in a relational database):

Recipients "table": rid, name, address, [etc]
Payments "table": rid, amount,year, [etc]

I have indexed all this data in one xapain database, with a term
prefix ("XTYPE:[payment|recipient]") and various other values/terms
according to the type of content.

What I would like to do is query "name:john year:2009 amount:10..100",
something that would be expressed in SQL more or less like:

SELECT recipients.name, payments.year, payments.amount
FROM recipients
LEFT JOIN payments
ON recipients.rid=payments.rid

Now, I understand that this isn't what xapian does, but it has many
other advantages to me that I would rather get xapian to support this
that some other database to support the searching xapian does.  I
would also like to avoid having 2 databases and all the hassle that

Some ideas I've had:

1) Simply duplicate the recipient information on every payment record.
 This is possible as there aren't likely to be more than 20 payments
per recipient and I have the disk space.  I can see this speeding
everything up too.

2) (A bad idea, I think) split the prefixes up in to queries for each
type of document (payments and recipients), search for all the
matching recipients, then search within these for the payment
information.  I would ideally like to do this by doing something like:

>  payment_enquire = xapian.Enquire(recipients_mset)

In other words, using the mset from the first query as the search
database for the second.  This doesn't seem to be an option though.  I
don't think the query builder does this, as from what I can tell it
just builds a single query that it processed.

Another way of doing this would be to search for the recipients, then
just OR all the IDs together and add it to the payment search (so,
"amount:10..1000 AND (id:1 OR id:2)").

This would be memory intensive (and slow), as to allow sorting of the
payment values I'd have to return every matching document from the
recipients search to the payment search.

3) Don't use xapian.  This is a last resort!  One of the main reasons
I'm using xapian is that my data isn't in a single scheme.  There are
maybe 10 core fields that is in each row, but maybe 30 more that is
only in some of the data.  	Using xapian to 'tag' this data on to a
document with term prefixes is a perfect way to deal with this.  Using
something like MySQL would mean I'd have to have loads of mostly
unused columns on most rows and I'd have to alter the scheme every
time a new field came along.  Unless there is some database that
allows me to add ad hoc values to a document/row it's no good.

The data will be read only and most of it needs to have a good,
stemmed text search on it too. So apart from this pseudo-join problem,
xapian is perfect.

Apologies for the long email: I'm finding it hard to be concise when
explaining this!

Help save the economy:

E: simon.roe at talusdesign.co.uk
M: 07742079314

More information about the Xapian-discuss mailing list