[Xapian-discuss] index only the new files
James Aylett
james-xapian at tartarus.org
Tue Apr 24 11:48:48 BST 2007
On Tue, Apr 24, 2007 at 09:55:13AM +0000, iX Gamerz wrote:
> 1) I use Omindex with success with some options like this :
>
> omindex --db /var/lib/xapian-omega/data/pdftagged/ --url /pdftagged
> /var/www/xapian/pdftagged_list/
>
> Is that possible to index only the new files recently copied without
> reindexing all from the beginning?
--duplicates ignore
should do what you want, providing you never update files. So it'll
ignore anything already in the database. This may not be quite what
you want, however.
> 2) This files are copied in differents folders where old files was already
> indexed.
>
> Is that possible to reindex only a part of the folders?
>
> I can use a mysql database to keep a trace of the new added files. And I can
> keep all the recent locations modified. but I don't understand how to use
> these informations to index only a little parts of the global database to
> keep the index up to date as fast as possible...
Currently there isn't a way of doing this. What we need is a small
change to omindex so it can take a list of files to
index/reindex. It's actually quite easy; there'd be two steps:
(1) changes to make it DIRECTORY... not DIRECTORY in the command line
params
(2) indirect through index_fs_object() instead of index_directory()
which can stat each file first (but only at top level, so costing
us almost nothing in the current usage)
I don't have time to do this myself right now, but (1) is a change to
the test at omindex.cc:793 followed by making omindex.cc:825 into a
loop; (2) is changing the omindex.cc:825 call (which will be a little
later by then) into a call to something like (completely untested, and
there should be some refactoring and I might have got some details
completely wrong :-):
----------------------------------------------------------------------
static void
index_fs_object(size_t depth_limit, const string &path,
map<string, string>& mime_map)
{
struct stat st;
string file = root + indexroot + path;
if (stat(file.c_str(), &st)) {
cout << "Could not work with " << path << ", skipping." << endl;
return;
}
is (S_ISDIR(&st)) {
index_directory(depth_limit, path, mime_map);
} else if (S_ISREG(&st)) {
string ext;
string::size_type dot = path.find_last_of('.');
if (dot != string::npos) ext = path.substr(dot + 1);
map<string,string>::iterator mt = mime_map.find(ext);
if (mt != mime_map.end()) {
// It's in our MIME map so we know how to index it.
const string & mimetype = mt->second;
try {
index_file(indexroot + url, mimetype,
st.st_mtime,
st.st_size);
} catch (NoSuchFilter) {
// FIXME: we ought to ignore by mime-type not extension.
cout << "Filter for \"" << mimetype << "\" not installed - ignoring extension \"" << ext << "\"" << endl;
mime_map.erase(mt);
}
}
}
}
----------------------------------------------------------------------
You'd need to use the extended base URI / base directory syntax, but I
think everyone should do that because it stops people thinking that
URIs and files are the same things ;-)
Alternatively you could use pdf2txt yourself directly, and use
scriptindex, but I suspect that's more work than is sensible in your
case.
J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-discuss
mailing list