[Xapian-discuss] omega: omindex behaviour with duplicate files

Thu Jul 12 11:28:33 BST 2007

On Thu, Jul 12, 2007 at 06:48:39PM +1000, John Pye wrote:

> I need a little clarification with regard to Omega's behaviour with
> 'duplicate' files when running 'omindex'.
> 
> How is a duplicate recognised? Is it simply by file path? How is an
> unmodified file detected, if at all?

It's done by constructed URL path. You could use the calculated MD5
hash to do modification detection, but it doesn't right now.

> I would like to set up subversion post-commit hook to update my index.
> If possible I would like to just update the index with the newly
> commited files. What is the most efficient way to do this? Is it
> something that has already been implemented by others?

Right now this can't be done using omindex. I *think* I posted a
potential patch a while back (or possibly just how to write the code)
so that you could provide a filename instead of a directory to
omindex. If you combine that with the -p switch, you can reindex a
single file at a time.

> Secondly, is there any way that the verbosity of the omindex output can
> be reduced? I would like it if there were a '--quiet' option that only
> output information about files that were actually being reindexed.

That's a good idea, but there's no way of doing it without changing
the code right now. If you can identify which messages you think
should be eliminated in --quiet mode, I can make the changes for you.

> I would like to set up this post-commit hook so that documents deleted
> from the repository are correctly removed from the index. At present my
> post-commit hook script works by brute force, and looks like this:
> 
> #!/bin/sh
> cd /data/omegadocs && svn up
> omindex -d ignore --db /var/lib/omega/data/default --url /svn/
> /data/omegadocs
> 
> If there are any tips for improving this, it would be much appreciated.

I'd recommend using scriptindex for this, which can delete a single
document (or several documents) more efficiently. However you do have
to be able to generate the unique U-term that omindex uses, which is
based on the constructed URL. It only gets fiddly if the URL is long -
delve(1) will help you construct them in the shorter cases, if you
can't read the omindex C++ source to find out the details.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org