[james@tartarus.org: Re: [Xapian-discuss] incremental indexing]

James Aylett james at tartarus.org
Wed Jul 20 18:17:03 BST 2005


Sigh. I can't remember the list address ...

J

----- Forwarded message from James Aylett <james at tartarus.org> -----

Date: Wed, 20 Jul 2005 18:05:53 +0100
To: Arshavir Grigorian <ag at m-cam.com>, xapian-discuss at tartarus.org
From: James Aylett <james at tartarus.org>
Subject: Re: [Xapian-discuss] incremental indexing
Mail-Followup-To: Arshavir Grigorian <ag at m-cam.com>,
	xapian-discuss at tartarus.org

On Wed, Jul 20, 2005 at 10:55:30AM -0400, Arshavir Grigorian wrote:

> I am very new to Xapian and would not have been able to generate the 
> dump. So thanks for the script.

Note that it's not the preferred way of looping over all documents,
but it works well enough in this case.

> I installed the bindings and ran the script after the first command, 
> after the second (on a clean database) as well as after running the 
> first and second commands in sequence. With a quick diff if looks like 
> running the second command both on a clean db and after the first 
> command generated the same results.

This would be the case if it's getting rid of any documents created by
the first run (well, the document ids will be different).

> Attached are the results from the your script as well as the exact 
> commands I used:
> 
> omindex --db /var/lib/omega/data/default --url '/pdf' /[path]/ 3915
> 
> omindex --db /var/lib/omega/data/default --url '/pdf' /[path]/ 3916
> (after cleaning the db path - rm /var/lib/omega/data/default/*).

Cool, thanks. It wasn't what I thought it might be, but that did push
me to check the source code :-).

Basically, there's a bug in omindex (effectively) that means that
incremental operation currently isn't supported, despite what the
overview document says.

I've attached an untested patch which should help here. You have to
pass a new command line switch, -p, to get the behaviour you want
(docs updated in the patch also).

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org

Index: omindex.cc
===================================================================
--- omindex.cc	(revision 6331)
+++ omindex.cc	(working copy)
@@ -2,7 +2,7 @@
  *
  * ----START-LICENCE----
  * Copyright 1999,2000,2001 BrightStation PLC
- * Copyright 2001 James Aylett
+ * Copyright 2001,2005 James Aylett
  * Copyright 2001,2002 Ananova Ltd
  * Copyright 2002,2003,2004,2005 Olly Betts
  *
@@ -579,6 +579,9 @@
     // If overwrite is true, the database will be created anew even if it
     // already exists.
     bool overwrite = false;
+    // If preserve_unupdated is false, delete any documents we don't
+    // replace (if in replace duplicates mode)
+    bool preserve_unupdated = false;
     size_t depth_limit = 0;
 
     static const struct option longopts[] = {
@@ -586,6 +589,7 @@
 	{ "version",	no_argument,		NULL, 'v' },
 	{ "overwrite",	no_argument,		NULL, 'o' },
 	{ "duplicates",	required_argument,	NULL, 'd' },
+	{ "preserve-nonduplicates",	no_argument,	NULL, 'p' },
 	{ "db",		required_argument,	NULL, 'D' },
 	{ "url",	required_argument,	NULL, 'U' },
 	{ "mime-type",	required_argument,	NULL, 'M' },
@@ -631,7 +635,7 @@
     mime_map["pm"] = "text/x-perl";
     mime_map["pod"] = "text/x-perl";
 
-    while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:l", longopts, NULL))!=EOF) {
+    while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:lp", longopts, NULL))!=EOF) {
 	switch (getopt_ret) {
 	case 'h':
 	    cout << OMINDEX << endl
@@ -639,6 +643,9 @@
 		 << "\t--url BASEURL [BASEDIRECTORY] DIRECTORY\n\n"
 		 << "Index static website data via the filesystem.\n"
 		 << "  -d, --duplicates\tset duplicate handling ('ignore' or 'replace')\n"
+	         << "  -p, --preserve-nonduplicates\n"
+		"\t\t\tdon't delete unupdated documents in\n"
+		"\t\t\tduplicate replace mode\n"
 		 << "  -D, --db\t\tpath to database to use\n"
 		 << "  -U, --url\t\tbase url DIRECTORY represents\n"
 	         << "  -M, --mime-type\tadditional MIME mapping ext:type\n"
@@ -654,7 +661,7 @@
 	case 'v':
 	    cout << OMINDEX << " (" << PACKAGE << ") " << VERSION << "\n"
 		 << "Copyright (c) 1999,2000,2001 BrightStation PLC.\n"
-		 << "Copyright (c) 2001 James Aylett\n"
+		 << "Copyright (c) 2001,2005 James Aylett\n"
 		 << "Copyright (c) 2001,2002 Ananova Ltd\n"
 		 << "Copyright (c) 2002,2003,2004,2005 Olly Betts\n\n"
 		 << "This is free software, and may be redistributed under\n"
@@ -670,6 +677,9 @@
 		break;
 	    }
 	    break;
+	case 'p': // don't delete unupdated documents
+	    preserve_unupdated = true;
+	    break;
 	case 'l': { // Set recursion limit
 	    int arg = atoi(optarg);
 	    if (arg < 0) arg = 0;
@@ -757,7 +767,7 @@
 	    db = Xapian::WritableDatabase(dbpath, Xapian::DB_CREATE_OR_OVERWRITE);
 	}
 	index_directory(depth_limit, "/", mime_map);
-	if (!skip_duplicates) {
+	if (!skip_duplicates && !preserve_unupdated) {
 	    for (Xapian::docid did = 1; did < updated.size(); ++did) {
 		if (!updated[did]) {
 		    try {
Index: docs/overview.txt
===================================================================
--- docs/overview.txt	(revision 6331)
+++ docs/overview.txt	(working copy)
@@ -117,6 +117,9 @@
 
 Index static website data via the filesystem.
   -d, --duplicates      set duplicate handling ('ignore' or 'replace')
+  -p, --preserve-nonduplicates
+		        don't delete documents not updated during
+			replace duplicate operation
   -D, --db              path to database to use
   -U, --url             base url DIRECTORY represents
   -M, --mime-type	additional MIME mapping ext:type
@@ -154,17 +157,17 @@
 passes, one for the '/press' site and one for the '/product' site. You
 might use the following commands:
 
-$ omindex --db /www/omega --url '/press' /www/example/press
-$ omindex --db /www/omega --url '/product' /www/example/product
+$ omindex -p --db /www/omega --url '/press' /www/example/press
+$ omindex -p --db /www/omega --url '/product' /www/example/product
 
 If you add a new large product, but don't want to reindex the whole of
 the product section, you could do:
 
-$ omindex --db /www/omega --url '/product' /www/example/product large
+$ omindex -p --db /www/omega --url '/product' /www/example/product large
 
 and just the large products will be reindexed. You need to do it like that, and not as:
 
-$ omindex --db /www/omega --url '/product/large' /www/example/product/large
+$ omindex -p --db /www/omega --url '/product/large' /www/example/product/large
 
 because that would make the large products part of a new site,
 '/product/large', which is unlikely to be what you want, as large
@@ -228,6 +231,17 @@
 completely static documents (eg: archive sites), while 'replace' is
 the most generally useful.
 
+With 'replace', omindex will remove any document it finds in the
+database that it did not update - in other words, it will clear out
+everything that doesn't exist any more. However if you are building up
+an omega database with several runs of omindex, this is not
+appropriate (as each run would delete the data from the previous run),
+so you should use the --preserve-nonduplicates. Note that if you
+choose to work like this, it is impossible to prune old documents from
+the database using omindex. If this is a problem for you, an
+alternative is to index each subsite into a different database, and
+merge all the databases together when searching.
+
 --depth-limit allows you to prevent omindex from descending more than
 a certain number of directories.  If you wish to replicate the old
 --no-recurse option, use ----depth-limit=1.


----- End forwarded message -----

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org
-------------- next part --------------
Index: omindex.cc
===================================================================
--- omindex.cc	(revision 6331)
+++ omindex.cc	(working copy)
@@ -2,7 +2,7 @@
  *
  * ----START-LICENCE----
  * Copyright 1999,2000,2001 BrightStation PLC
- * Copyright 2001 James Aylett
+ * Copyright 2001,2005 James Aylett
  * Copyright 2001,2002 Ananova Ltd
  * Copyright 2002,2003,2004,2005 Olly Betts
  *
@@ -579,6 +579,9 @@
     // If overwrite is true, the database will be created anew even if it
     // already exists.
     bool overwrite = false;
+    // If preserve_unupdated is false, delete any documents we don't
+    // replace (if in replace duplicates mode)
+    bool preserve_unupdated = false;
     size_t depth_limit = 0;
 
     static const struct option longopts[] = {
@@ -586,6 +589,7 @@
 	{ "version",	no_argument,		NULL, 'v' },
 	{ "overwrite",	no_argument,		NULL, 'o' },
 	{ "duplicates",	required_argument,	NULL, 'd' },
+	{ "preserve-nonduplicates",	no_argument,	NULL, 'p' },
 	{ "db",		required_argument,	NULL, 'D' },
 	{ "url",	required_argument,	NULL, 'U' },
 	{ "mime-type",	required_argument,	NULL, 'M' },
@@ -631,7 +635,7 @@
     mime_map["pm"] = "text/x-perl";
     mime_map["pod"] = "text/x-perl";
 
-    while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:l", longopts, NULL))!=EOF) {
+    while ((getopt_ret = gnu_getopt_long(argc, argv, "hvd:D:U:M:lp", longopts, NULL))!=EOF) {
 	switch (getopt_ret) {
 	case 'h':
 	    cout << OMINDEX << endl
@@ -639,6 +643,9 @@
 		 << "\t--url BASEURL [BASEDIRECTORY] DIRECTORY\n\n"
 		 << "Index static website data via the filesystem.\n"
 		 << "  -d, --duplicates\tset duplicate handling ('ignore' or 'replace')\n"
+	         << "  -p, --preserve-nonduplicates\n"
+		"\t\t\tdon't delete unupdated documents in\n"
+		"\t\t\tduplicate replace mode\n"
 		 << "  -D, --db\t\tpath to database to use\n"
 		 << "  -U, --url\t\tbase url DIRECTORY represents\n"
 	         << "  -M, --mime-type\tadditional MIME mapping ext:type\n"
@@ -654,7 +661,7 @@
 	case 'v':
 	    cout << OMINDEX << " (" << PACKAGE << ") " << VERSION << "\n"
 		 << "Copyright (c) 1999,2000,2001 BrightStation PLC.\n"
-		 << "Copyright (c) 2001 James Aylett\n"
+		 << "Copyright (c) 2001,2005 James Aylett\n"
 		 << "Copyright (c) 2001,2002 Ananova Ltd\n"
 		 << "Copyright (c) 2002,2003,2004,2005 Olly Betts\n\n"
 		 << "This is free software, and may be redistributed under\n"
@@ -670,6 +677,9 @@
 		break;
 	    }
 	    break;
+	case 'p': // don't delete unupdated documents
+	    preserve_unupdated = true;
+	    break;
 	case 'l': { // Set recursion limit
 	    int arg = atoi(optarg);
 	    if (arg < 0) arg = 0;
@@ -757,7 +767,7 @@
 	    db = Xapian::WritableDatabase(dbpath, Xapian::DB_CREATE_OR_OVERWRITE);
 	}
 	index_directory(depth_limit, "/", mime_map);
-	if (!skip_duplicates) {
+	if (!skip_duplicates && !preserve_unupdated) {
 	    for (Xapian::docid did = 1; did < updated.size(); ++did) {
 		if (!updated[did]) {
 		    try {
Index: docs/overview.txt
===================================================================
--- docs/overview.txt	(revision 6331)
+++ docs/overview.txt	(working copy)
@@ -117,6 +117,9 @@
 
 Index static website data via the filesystem.
   -d, --duplicates      set duplicate handling ('ignore' or 'replace')
+  -p, --preserve-nonduplicates
+		        don't delete documents not updated during
+			replace duplicate operation
   -D, --db              path to database to use
   -U, --url             base url DIRECTORY represents
   -M, --mime-type	additional MIME mapping ext:type
@@ -154,17 +157,17 @@
 passes, one for the '/press' site and one for the '/product' site. You
 might use the following commands:
 
-$ omindex --db /www/omega --url '/press' /www/example/press
-$ omindex --db /www/omega --url '/product' /www/example/product
+$ omindex -p --db /www/omega --url '/press' /www/example/press
+$ omindex -p --db /www/omega --url '/product' /www/example/product
 
 If you add a new large product, but don't want to reindex the whole of
 the product section, you could do:
 
-$ omindex --db /www/omega --url '/product' /www/example/product large
+$ omindex -p --db /www/omega --url '/product' /www/example/product large
 
 and just the large products will be reindexed. You need to do it like that, and not as:
 
-$ omindex --db /www/omega --url '/product/large' /www/example/product/large
+$ omindex -p --db /www/omega --url '/product/large' /www/example/product/large
 
 because that would make the large products part of a new site,
 '/product/large', which is unlikely to be what you want, as large
@@ -228,6 +231,17 @@
 completely static documents (eg: archive sites), while 'replace' is
 the most generally useful.
 
+With 'replace', omindex will remove any document it finds in the
+database that it did not update - in other words, it will clear out
+everything that doesn't exist any more. However if you are building up
+an omega database with several runs of omindex, this is not
+appropriate (as each run would delete the data from the previous run),
+so you should use the --preserve-nonduplicates. Note that if you
+choose to work like this, it is impossible to prune old documents from
+the database using omindex. If this is a problem for you, an
+alternative is to index each subsite into a different database, and
+merge all the databases together when searching.
+
 --depth-limit allows you to prevent omindex from descending more than
 a certain number of directories.  If you wish to replicate the old
 --no-recurse option, use ----depth-limit=1.


More information about the Xapian-discuss mailing list