[Xapian-discuss] Problem with stop words by indexing

Emmanuel Engelhart emmanuel at engelhart.org
Sun Jun 13 19:16:57 BST 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/06/2010 16:18, Olly Betts wrote:
> On Thu, May 27, 2010 at 03:20:36PM +0200, emmanuel at engelhart.org wrote:
>> I have (in termegenerator_internal.cc, line 129) changed the default value of
>> stop_mode from STOPWORDS_INDEX_UNSTEMMED_ONLY to STOPWORDS_IGNORE and xapian
>> does now exactly what I want.
>>
>> Wouldn't be possible to simply add a property "stopper_strategy" to the
>> termgenerator (or to the stopper) class and a method to modify it (like
>> set_stopper_strategy() ?
> 
> Sure, want to work up a patch?
> 
> Cheers,
>     Olly

So, I propose here a simple patch which adds a public strategy property
to the stopper classes.

If you do nothing, the indexer works like before. But you may also set
stopper.strategy = Xapian::STOPWORDS_IGNORE and it will simply ignore
the unstemmed version of all your stopwords... Exactly what was not
possible before and needed by a few people (like me).

The diff file is located here:
http://tmp.kiwix.org/tmp/xapian-core-stopper-strategy.diff

It looks like following:
Index: queryparser/termgenerator_internal.cc
===================================================================
- --- queryparser/termgenerator_internal.cc	(révision 14701)
+++ queryparser/termgenerator_internal.cc	(copie de travail)
@@ -117,19 +117,12 @@
     return 0;
 }

- -// FIXME: add API for this:
- -#define STOPWORDS_NONE 0
- -#define STOPWORDS_IGNORE 1
- -#define STOPWORDS_INDEX_UNSTEMMED_ONLY 2
- -
 void
 TermGenerator::Internal::index_text(Utf8Iterator itor, termcount weight,
 				    const string & prefix, bool with_positions)
 {
- -    int stop_mode = STOPWORDS_INDEX_UNSTEMMED_ONLY;
+    const stopper_strategy stop_mode = stopper ? stopper->strategy :
STOPWORDS_NONE;

- -    if (!stopper) stop_mode = STOPWORDS_NONE;
- -
     while (true) {
 	// Advance to the start of the next term.
 	unsigned ch;
Index: include/xapian/queryparser.h
===================================================================
- --- include/xapian/queryparser.h	(révision 14701)
+++ include/xapian/queryparser.h	(copie de travail)
@@ -32,12 +32,16 @@

 namespace Xapian {

+typedef enum { STOPWORDS_NONE, STOPWORDS_IGNORE,
STOPWORDS_INDEX_UNSTEMMED_ONLY } stopper_strategy;
+
 class Database;
 class Stem;

 /// Base class for stop-word decision functor.
 class XAPIAN_VISIBILITY_DEFAULT Stopper {
   public:
+    stopper_strategy strategy;
+
     /// Is term a stop-word?
     virtual bool operator()(const std::string & term) const = 0;

@@ -54,7 +58,9 @@

   public:
     /// Default constructor.
- -    SimpleStopper() { }
+    SimpleStopper() {
+      strategy = STOPWORDS_INDEX_UNSTEMMED_ONLY;
+    }

     /// Initialise from a pair of iterators.
 #ifndef __SUNPRO_CC

Regards
Emmanuel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwVIJkACgkQn3IpJRpNWtPcDACeNSQsohnIpa772M7QnLkmy8t1
y0wAn2l3tr9lJcNcFaFEURzw3FAtPZO9
=5wWt
-----END PGP SIGNATURE-----



More information about the Xapian-discuss mailing list