[Xapian-discuss] omega: omindex behaviour with duplicate files

John Pye john at curioussymbols.com
Tue Jul 24 02:38:07 BST 2007


Hi James

Sorry for the delay,

James Aylett wrote:
> On Thu, Jul 12, 2007 at 06:48:39PM +1000, John Pye wrote:
>
>   
>> I need a little clarification with regard to Omega's behaviour with
>> 'duplicate' files when running 'omindex'.
>>
>> How is a duplicate recognised? Is it simply by file path? How is an
>> unmodified file detected, if at all?
>>     
>
> It's done by constructed URL path. You could use the calculated MD5
> hash to do modification detection, but it doesn't right now.
>   

This would be a nice feature to have in omindex, as I'm sure it must be
a common problem for many entry-level users like me.

>   
>> I would like to set up subversion post-commit hook to update my index.
>> If possible I would like to just update the index with the newly
>> commited files. What is the most efficient way to do this? Is it
>> something that has already been implemented by others?
>>     
>
> Right now this can't be done using omindex. I *think* I posted a
> potential patch a while back (or possibly just how to write the code)
> so that you could provide a filename instead of a directory to
> omindex. If you combine that with the -p switch, you can reindex a
> single file at a time.
>   

This would be a big help. With this flag, I could write a much better
subversion post-commit hook script using the list of committed files to
update the index file-by-file.

>   
>> Secondly, is there any way that the verbosity of the omindex output can
>> be reduced? I would like it if there were a '--quiet' option that only
>> output information about files that were actually being reindexed.
>>     
>
> That's a good idea, but there's no way of doing it without changing
> the code right now. If you can identify which messages you think
> should be eliminated in --quiet mode, I can make the changes for you.
>
>   
>> I would like to set up this post-commit hook so that documents deleted
>> from the repository are correctly removed from the index. At present my
>> post-commit hook script works by brute force, and looks like this:
>>
>> #!/bin/sh
>> cd /data/omegadocs && svn up
>> omindex -d ignore --db /var/lib/omega/data/default --url /svn/
>> /data/omegadocs
>>
>> If there are any tips for improving this, it would be much appreciated.
>>     
>
> I'd recommend using scriptindex for this, which can delete a single
> document (or several documents) more efficiently. However you do have
> to be able to generate the unique U-term that omindex uses, which is
> based on the constructed URL. It only gets fiddly if the URL is long -
> delve(1) will help you construct them in the shorter cases, if you
> can't read the omindex C++ source to find out the details.
>   

I guess I will have to take a look at this. But being a little useful
thing for our internal use, I'm unlikely to spend much time on it.

Cheers
JP




More information about the Xapian-discuss mailing list