[Xapian-discuss] index only the new files

Tue Apr 24 11:48:48 BST 2007

On Tue, Apr 24, 2007 at 09:55:13AM +0000, iX Gamerz wrote:

> 1) I use Omindex with success with some options like this :
> 
> omindex --db /var/lib/xapian-omega/data/pdftagged/ --url /pdftagged
> /var/www/xapian/pdftagged_list/
> 
> Is that possible to index only the new files recently copied without
> reindexing all from the beginning?

--duplicates ignore

should do what you want, providing you never update files. So it'll
ignore anything already in the database. This may not be quite what
you want, however.

> 2) This files are copied in differents folders where old files was already
> indexed.
> 
> Is that possible to reindex only a part of the folders?
> 
> I can use a mysql database to keep a trace of the new added files. And I can
> keep all the recent locations modified. but I don't understand how to use
> these informations to index only a little parts of the global database to
> keep the index up to date as fast as possible...

Currently there isn't a way of doing this. What we need is a small
change to omindex so it can take a list of files to
index/reindex. It's actually quite easy; there'd be two steps:

 (1) changes to make it DIRECTORY... not DIRECTORY in the command line
     params

 (2) indirect through index_fs_object() instead of index_directory()
     which can stat each file first (but only at top level, so costing
     us almost nothing in the current usage)

I don't have time to do this myself right now, but (1) is a change to
the test at omindex.cc:793 followed by making omindex.cc:825 into a
loop; (2) is changing the omindex.cc:825 call (which will be a little
later by then) into a call to something like (completely untested, and
there should be some refactoring and I might have got some details
completely wrong :-):

----------------------------------------------------------------------
static void
index_fs_object(size_t depth_limit, const string &path,
                map<string, string>& mime_map)
{
    struct stat st;
    string file = root + indexroot + path;
    if (stat(file.c_str(), &st)) {
        cout << "Could not work with " << path << ", skipping." << endl;
        return;
    }
    is (S_ISDIR(&st)) {
        index_directory(depth_limit, path, mime_map);
    } else if (S_ISREG(&st)) {
        string ext;
        string::size_type dot = path.find_last_of('.');
        if (dot != string::npos) ext = path.substr(dot + 1);

        map<string,string>::iterator mt = mime_map.find(ext);
        if (mt != mime_map.end()) {
            // It's in our MIME map so we know how to index it.
            const string & mimetype = mt->second;
            try {
                index_file(indexroot + url, mimetype,
		           st.st_mtime,
                           st.st_size);
            } catch (NoSuchFilter) {
                // FIXME: we ought to ignore by mime-type not extension.
                cout << "Filter for \"" << mimetype << "\" not installed - ignoring extension \"" << ext << "\"" << endl;
                mime_map.erase(mt);
            }
        }
    }
}
----------------------------------------------------------------------

You'd need to use the extended base URI / base directory syntax, but I
think everyone should do that because it stops people thinking that
URIs and files are the same things ;-)

Alternatively you could use pdf2txt yourself directly, and use
scriptindex, but I suspect that's more work than is sensible in your
case.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org