[Xapian-devel] Omega changes

James Aylett james-xapian at tartarus.org
Fri Dec 17 15:05:13 GMT 2004


On Fri, Dec 17, 2004 at 02:15:34PM +0000, Richard Boulton wrote:

> 1) Configuration handling for omega.

+1

> 2) I propose to implement this a new duplicates option (call it
> "timestamp"), and make it the default duplicates option.

+1

> Actually, Olly suggested that it might be sensible to remove the
> duplicates options entirely, and simply default to the behaviour
> specified above.  Does anyone actually use omindex with a --duplicates
> option other than "replace"?

I doubt it very much. They're only there for some measure of backwards
compatibility in case anyone actually liked the old way of working.

--duplicates=ignore was designed to save time when you only add
documents to the corpus. Shouldn't be needed with
--duplicates=timestamp, and I can't think of a good reason to use
replace instead of timestamp.

--duplicates=duplicate is daft, but it was easy to add :-)

I'd be happy to lose this option. It'd make the quickstart
instructions a lot more obvious, too :)

> 3) Add database specific configuration files to omindex, which are used
> to specify how a database has been indexed.  These configuration files
> could consist simply of the command line options used, or possibly
> equivalent information in an easy-to-parse format.  The configuration
> file could be used by omega to configure the query parser, and other
> search options, appropriately to the database being searched.

+1. At least.

However: how would you cope with this with two databases with
different indexing options? Specifically, is there anything sane we
can do with different stemmers in use?
 
> In addition to current options, these configuration files could specify
> which information to store in the 

... ? :)
 
> 4) Finally, I propose changing the way in which omega and omindex map
> file locations to urls.  Currently, the URL at which a document is
> displayed is stored in each document in the Xapian database.  This has
> the obvious drawback that the index needs to be regenerated if a server
> is reconfigured (for example, change of hostname, or change of path
> within the server).
> 
> Instead, omindex would store the local path of the document in the
> database, and would store no information about the URLs at which
> documents are available externally.  Omega would be provided with a
> translation table in each database from local file prefix to external
> file prefix, and would use this to generate the external URLs.  I've
> used this scheme with other systems, so I know it can be made to work,
> but it would require some changes to applications currently using
> omindex.

Hmm. What I think you're saying is that we do the following:

index option: file-path url-path filename{file-suffix}
indexes file: file-path/filename{file-suffix}
mapping:      file-suffix -> url-suffix [may have several of these]
config:       url-prefix

final url:    url-prefix/url-path/filename{url-suffix}
stored in db: url-path/filename{url-suffix}

So if you have (Apache terms) a DocumentRoot for http://example.com/
of /sites/example.com we might have (assuming that no mappings will
just map file-suffix to url-suffix in every case):

global config:
--------------
url-prefix: http://example.com

index config:
-------------
file-path: /sites/example.com
url-path:

which will index the whole thing, no problems.

index config:
-------------
file-prefix: /sites/example.com/company
url-path: company

index config:
-------------
file-path: /sites/press-area/
url-path: press

to index two subparts. You can then do the root with --no-recurse.

That's all fine. With some finesse, we can avoid having to specify
lots of mappings when you don't have suffices in the URLs (which you
shouldn't).

What we're talking about is shifting the [BASEDIRECTORY] DIRECTORY
split into a [URLPATH] DIRECTORY split. I can't think of any
problems with that, and indeed it probably makes a lot more sense to
people that aren't me (more accurately, me three years ago :-) than
the current way of doing it. Better, URLPATH should be mandatory, and
you can just put / in if you're doing the whole site.

> Finally, is there a problem with making any of these changes whilst
> we're within the 0.8.x version cycle, or is the expectation that the
> workings of omega and related tools will be reasonably stable within
> this cycle, as the API of libxapian is.

I'd be inclined to hold off the db-specific config until 0.9.x,
personally. The other changes - configuration location, which has
always been broken, and duplicates, which will make life better
without (hopefully) any drawbacks - I'd say go ahead now.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org




More information about the Xapian-devel mailing list