[Xapian-discuss] How to omindex some sub-directories?

Charles xapian at charlesmatkinson.org
Mon May 20 12:31:41 BST 2013


On 17/05/13 03:29, Olly Betts wrote:
> On Wed, May 15, 2013 at 12:05:01PM +0530, Charles wrote:
>> Given a directory tree like ...
>>
>> /foo
>> |
>> +-- A
>> |
>> +-- B
>> |
>> +-- C
>>
>> ... what is the best way to index A and C into a single Xapian database?
> 
> I guess you mean "A and B" (or "C" not "B" below)...
> 
>> AFAIK the alternatives are:
>>
>> omindex --db /my_db --no-delete /foo /foo/A
>> omindex --db /my_db --no-delete /foo /foo/B
> 
> I think it is better to use --url (though I find the subsite stuff
> confusing, so I may be misunderstanding the plan behind it):
> 
> omindex --db /my_db --no-delete --url /foo/A /foo/A
> omindex --db /my_db --no-delete --url /foo/B /foo/B
> 
>> or
>>
>> omindex --db /my_A_db /foo /foo/A
>> omindex --db /my_B_db /foo /foo/B
>> xapian-compact /my_A_db /my_B_db /my_db
> 
> Another approach is to "fake up" the tree you want to index with
> symlinks and then index that - e.g.:
> 
> mkdir foo-to-index
> ln -s /foo/A /foo/B foo-to-index
> omindex --db /my_db --follow --url /foo foo-to-index
> 
> If you have symlinks in the tree you don't want to follow, then bind
> mounts are another option (at least on Linux if you have root access).
> 
> You can also just search the databases together.  If you pass multiple
> DB parameters to omega, it'll search them together.  You can also pass
> DB parameters with a '/' in, which are split at the '/' into multiple
> DB names to search.
> 
>> The first alternative does not delete files deleted from the file system
>> from the database.  Is there any way around this except by emptying the
>> database and starting over?
> 
> Not as things are.  I think the "subsite" feature really needs
> overhauling - this '--no-delete' restriction means it's pretty much
> useless if you ever delete documents.
> 
> If you use the --url variant above, then each document is indexed by
> either term P/foo/A or term P/foo/B, so these terms can be used to limit
> the deletion of documents we didn't see in this run to exactly those
> which we should consider deleting.  Annoyingly that doesn't quite work
> in general though - if the --url setting contains a hostname, it's split
> off so if you index http://blog.example.org/ and
> http://shop.example.org/ then they'd both have P/ (and also
> Hblog.example.org or Hshop.example.org).  We could run a query to find
> the documents to delete, but that's significantly more costly that just
> iterating all documents or a single term.
> 
> Are people using the current subsites or other multiple run tricks with
> --no-delete?
> 
> If so, are you using the P terms as they currently are?  Is there
> anything we should try to make sure still works (or at least is still
> possible to achieve in a different way)?
> 
> If not, what prevents you from using them?
> 
> Cheers,
>     Olly
> 
Thanks Olly :-)

Sorry for the C/B mix up -- you guessed correctly -- and thanks for the
full reply.

It may help if I explain what we want to do in less generic terms.  We
have a samba file server and the users want to be able to search files
in the shares.  The plan is to use omindex with omega for the user
interface.

There are several directories, say /foo/{A,B,C, ... Z}, each used as a
samba share, of which only some should be indexed and visible via omega.
 For the users the /foo part is irrelevant while the {A,B,C, ... Z} are
meaningful.

I've never been clear about what goes on inside a Xapian database but
understand the /foo can be stripped off and the Apache/omega
configuration made to serve everything relative to /foo.

For the use case, running omega with several databases sounds perfect;
several user URLs could be set up to provide a search for several sets
from {A,B,C, ... Z} to suit various user groups -- Research, Engineering
whatever.

In answer to your question, we do not use the --no-delete option.

Just out of curiosity, what is a "P term"?  Searching for it just finds
tickets and code.

Best, CHarles





More information about the Xapian-discuss mailing list