[Xapian-discuss] Problem with stop words by indexing
Emmanuel Engelhart
emmanuel at engelhart.org
Sun Jun 13 19:16:57 BST 2010
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 01/06/2010 16:18, Olly Betts wrote:
> On Thu, May 27, 2010 at 03:20:36PM +0200, emmanuel at engelhart.org wrote:
>> I have (in termegenerator_internal.cc, line 129) changed the default value of
>> stop_mode from STOPWORDS_INDEX_UNSTEMMED_ONLY to STOPWORDS_IGNORE and xapian
>> does now exactly what I want.
>>
>> Wouldn't be possible to simply add a property "stopper_strategy" to the
>> termgenerator (or to the stopper) class and a method to modify it (like
>> set_stopper_strategy() ?
>
> Sure, want to work up a patch?
>
> Cheers,
> Olly
So, I propose here a simple patch which adds a public strategy property
to the stopper classes.
If you do nothing, the indexer works like before. But you may also set
stopper.strategy = Xapian::STOPWORDS_IGNORE and it will simply ignore
the unstemmed version of all your stopwords... Exactly what was not
possible before and needed by a few people (like me).
The diff file is located here:
http://tmp.kiwix.org/tmp/xapian-core-stopper-strategy.diff
It looks like following:
Index: queryparser/termgenerator_internal.cc
===================================================================
- --- queryparser/termgenerator_internal.cc (révision 14701)
+++ queryparser/termgenerator_internal.cc (copie de travail)
@@ -117,19 +117,12 @@
return 0;
}
- -// FIXME: add API for this:
- -#define STOPWORDS_NONE 0
- -#define STOPWORDS_IGNORE 1
- -#define STOPWORDS_INDEX_UNSTEMMED_ONLY 2
- -
void
TermGenerator::Internal::index_text(Utf8Iterator itor, termcount weight,
const string & prefix, bool with_positions)
{
- - int stop_mode = STOPWORDS_INDEX_UNSTEMMED_ONLY;
+ const stopper_strategy stop_mode = stopper ? stopper->strategy :
STOPWORDS_NONE;
- - if (!stopper) stop_mode = STOPWORDS_NONE;
- -
while (true) {
// Advance to the start of the next term.
unsigned ch;
Index: include/xapian/queryparser.h
===================================================================
- --- include/xapian/queryparser.h (révision 14701)
+++ include/xapian/queryparser.h (copie de travail)
@@ -32,12 +32,16 @@
namespace Xapian {
+typedef enum { STOPWORDS_NONE, STOPWORDS_IGNORE,
STOPWORDS_INDEX_UNSTEMMED_ONLY } stopper_strategy;
+
class Database;
class Stem;
/// Base class for stop-word decision functor.
class XAPIAN_VISIBILITY_DEFAULT Stopper {
public:
+ stopper_strategy strategy;
+
/// Is term a stop-word?
virtual bool operator()(const std::string & term) const = 0;
@@ -54,7 +58,9 @@
public:
/// Default constructor.
- - SimpleStopper() { }
+ SimpleStopper() {
+ strategy = STOPWORDS_INDEX_UNSTEMMED_ONLY;
+ }
/// Initialise from a pair of iterators.
#ifndef __SUNPRO_CC
Regards
Emmanuel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkwVIJkACgkQn3IpJRpNWtPcDACeNSQsohnIpa772M7QnLkmy8t1
y0wAn2l3tr9lJcNcFaFEURzw3FAtPZO9
=5wWt
-----END PGP SIGNATURE-----
More information about the Xapian-discuss
mailing list