[Xapian-discuss] Practical example/explanation using an existing database

Tue Jul 24 15:19:54 BST 2007

(I'm probably repeating some information already given by others -
hopefully this helps by being all in one place!)

On Tue, Jul 24, 2007 at 04:04:53AM +0200, Edwin Smulders wrote:

> Firstly, how exactly does the indexing work in regard to telling
> Xapian what to search through? Do we write an SQL query returning all
> the data we want indexed? or maybe do we tell it what tables/columns
> to index (ie. does it generate queries?)
> And how is the index updated, a regular rescan or an update whenever
> data in our system updates?

Hi, Edwin. You should use Omega on top of Xapian, which gives you most
of the search engine you'll need. Omega comes with a script dbi2omega
which dumps an SQL database into something suitable for running
scriptindex over.

You'll need to figure out how to update the index yourself. If it's
small, just rebuild the entire thing; if not, and you can detect
changed entries from the SQL database, modifying dbi2omega to include
a WHERE clause wouldn't be difficult.

> The other question that came to mind is, once everything is indexed,
> how is the data returned on a search?

A lot of this depends on how your index plan works. Basically,
scriptindex takes an input file (the data, produced in this case by
dbi2omega) and an index file, which describes how the Xapian database
is built. Say you have a table:

----------------------------------------------------------------------
CREATE TABLE `t` (
       id INT NOT NULL,
       name VARCHAR(50) NOT NULL DEFAULT "",
       description VARCHAR(255) NOT NULL DEFAULT "",
       content TEXT NOT NULL DEFAULT "",
       PRIMARY KEY(id)
);
----------------------------------------------------------------------

You might decide that you want people to be able to search on contents
of names, contents of descriptions, or across any text (name,
description or content).

The way you do searching within particular fields in Xapian is to use
term prefixes. Look at docs/termprefixes.txt in the Omega distribution
for a background on this.

So you'd want to have:

 * name indexed with no prefix
 * name indexed with a prefix of 'S' (subject)
 * description indexed with no prefix
 * description indexed with a prefix of 'S' [1]
 * content indexed with no prefix

Your index file might look like:

----------------------------------------------------------------------
id: field=id unique=Q
name: field=name index index=S
description: index index=S truncate=200 field=sample
content: index
----------------------------------------------------------------------

The field= bits put fields into the document data, so that you can
extract them later. (See the omegascript documentation.)

You read the lines from left to right, so the 'description' one (for
instance) says:

 * first, index the text
 * then index the text again, with a prefix of 'S'
 * then truncate the text to 200 characters at most, but avoiding
   truncating a word
 * then put the truncated text into the ``sample'' field

Then you need to set up a suitable template in omega. I'd recommend
using the default to start off with until you have a feeling for
what's going on, and then start use the xml template to get data into
the rest of your system, which can hook up against the database as
needed. (Or you may not need this. Depends what you're trying to do.)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org