[Xapian-discuss] How to omindex some sub-directories?

Olly Betts olly at survex.com
Thu May 16 22:59:19 BST 2013


On Wed, May 15, 2013 at 12:05:01PM +0530, Charles wrote:
> Given a directory tree like ...
> 
> /foo
> |
> +-- A
> |
> +-- B
> |
> +-- C
> 
> ... what is the best way to index A and C into a single Xapian database?

I guess you mean "A and B" (or "C" not "B" below)...

> AFAIK the alternatives are:
> 
> omindex --db /my_db --no-delete /foo /foo/A
> omindex --db /my_db --no-delete /foo /foo/B

I think it is better to use --url (though I find the subsite stuff
confusing, so I may be misunderstanding the plan behind it):

omindex --db /my_db --no-delete --url /foo/A /foo/A
omindex --db /my_db --no-delete --url /foo/B /foo/B

> or
> 
> omindex --db /my_A_db /foo /foo/A
> omindex --db /my_B_db /foo /foo/B
> xapian-compact /my_A_db /my_B_db /my_db

Another approach is to "fake up" the tree you want to index with
symlinks and then index that - e.g.:

mkdir foo-to-index
ln -s /foo/A /foo/B foo-to-index
omindex --db /my_db --follow --url /foo foo-to-index

If you have symlinks in the tree you don't want to follow, then bind
mounts are another option (at least on Linux if you have root access).

You can also just search the databases together.  If you pass multiple
DB parameters to omega, it'll search them together.  You can also pass
DB parameters with a '/' in, which are split at the '/' into multiple
DB names to search.

> The first alternative does not delete files deleted from the file system
> from the database.  Is there any way around this except by emptying the
> database and starting over?

Not as things are.  I think the "subsite" feature really needs
overhauling - this '--no-delete' restriction means it's pretty much
useless if you ever delete documents.

If you use the --url variant above, then each document is indexed by
either term P/foo/A or term P/foo/B, so these terms can be used to limit
the deletion of documents we didn't see in this run to exactly those
which we should consider deleting.  Annoyingly that doesn't quite work
in general though - if the --url setting contains a hostname, it's split
off so if you index http://blog.example.org/ and
http://shop.example.org/ then they'd both have P/ (and also
Hblog.example.org or Hshop.example.org).  We could run a query to find
the documents to delete, but that's significantly more costly that just
iterating all documents or a single term.

Are people using the current subsites or other multiple run tricks with
--no-delete?

If so, are you using the P terms as they currently are?  Is there
anything we should try to make sure still works (or at least is still
possible to achieve in a different way)?

If not, what prevents you from using them?

Cheers,
    Olly



More information about the Xapian-discuss mailing list