[Xapian-discuss] Size of the index
Justine Demeyer
justine.demeyer at gmail.com
Tue Nov 25 18:34:55 GMT 2008
More precisly, I have an error saying that I can't put a SimpleStopper as a
parameter of set_stopper....
2008/11/25 Justine Demeyer <justine.demeyer at gmail.com>
> Yes, I tried it but it doesn't work.
>
> I tried an example with this :
>
> Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
> Xapian::TermGenerator indexer;
> Xapian::Stem stemmer("english");
> Xapian::SimpleStopper stop;
> stop.add("the");
> indexer.set_stemmer(stemmer);
> indexer.set_stopper(stop);
>
>
> 2008/11/25 Robert Young <rob at roryoung.co.uk>
>
>> You could use a SimpleStopper
>>
>> http://xapian.org/docs/apidoc/html/classXapian_1_1SimpleStopper.html
>>
>> On Tue, Nov 25, 2008 at 3:55 PM, Justine Demeyer
>> <justine.demeyer at gmail.com>wrote:
>>
>> > Thanks for your help but I don't know how to use this stop words. I saw
>> > that
>> > I have to add to my file : indexer.set_stopper() but what I have to put
>> > between ()??
>> >
>> > Thanks
>> >
>> > 2008/11/25 Robert Young <rob at roryoung.co.uk>
>> >
>> > > Oops, xapian-discuss doesn't seem to set reply-to.
>> > >
>> > > Stop words that appear in such a high proportion of the documents in
>> your
>> > > corpus that they can be safely ignored. Words like 'the', 'and', 'a'
>> etc.
>> > > Remove these and you can improve the precision of your queries, the
>> > > performance of both queries and indexing and reduce the size of your
>> > index.
>> > > At the potential expense of recall.
>> > >
>> > > Cheers
>> > > Rob
>> > >
>> > > On Tue, Nov 25, 2008 at 2:23 PM, Justine Demeyer
>> > > <justine.demeyer at gmail.com>wrote:
>> > >
>> > > >
>> > > > Ok, thanks!!
>> > > >
>> > > > But what is the purpose of the stop words??
>> > > >
>> > > >
>> > > > 2008/11/25 Robert Young <rob at roryoung.co.uk>
>> > > >
>> > > > As Henry alluded to earlier, you could potentially reduce the size
>> of
>> > > your
>> > > >> index by removing stop words.
>> > > >>
>> > > >> Cheers
>> > > >> Rob
>> > > >>
>> > > >>
>> > > >> On Tue, Nov 25, 2008 at 10:32 AM, Justine Demeyer <
>> > > >> justine.demeyer at gmail.com> wrote:
>> > > >>
>> > > >>> Here is the code of the index :
>> > > >>>
>> > > >>> void Index(char* ind, char* directory)
>> > > >>> {
>> > > >>> try
>> > > >>> {
>> > > >>> timeval tim;
>> > > >>> double t1, t2, dif;
>> > > >>>
>> > > >>> string index(ind);
>> > > >>>
>> > > >>> //Heure de debut de l'operation
>> > > >>> gettimeofday(&tim, NULL);
>> > > >>> t1=tim.tv_sec+(tim.tv_usec/1000000.0);
>> > > >>>
>> > > >>> //Creattion ou ouverture de l'index
>> > > >>> Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
>> > > >>> Xapian::TermGenerator indexer;
>> > > >>> Xapian::Stem stemmer("english");
>> > > >>> indexer.set_stemmer(stemmer);
>> > > >>>
>> > > >>>
>> > > >>> struct dirent *lecture;
>> > > >>> DIR *rep;
>> > > >>>
>> > > >>> rep = opendir(directory);
>> > > >>> while((lecture = readdir(rep)))
>> > > >>> {
>> > > >>>
>> > > >>> char* name = lecture->d_name;
>> > > >>> std::string name2(name);
>> > > >>>
>> > > >>> string path= directory+name2;
>> > > >>>
>> > > >>> ifstream fichier(path.c_str(), ios::in);
>> > > >>>
>> > > >>> if(fichier) // ce test Ã(c)choue si le fichier n'est
>> > pas
>> > > >>> ouvert
>> > > >>> {
>> > > >>> string ligne; // variable contenant chaque ligne
>> > lue
>> > > >>> string contenu;
>> > > >>>
>> > > >>> // cette boucle s'arrête dès qu'une erreur de
>> > > lecture
>> > > >>> survient
>> > > >>> while(std::getline(fichier, ligne))
>> > > >>> {
>> > > >>> contenu = contenu + ligne + "\n";
>> > > >>> }
>> > > >>>
>> > > >>> //Indexation
>> > > >>> Xapian::Document doc;
>> > > >>> doc.set_data(contenu);
>> > > >>>
>> > > >>> indexer.set_document(doc);
>> > > >>> indexer.index_text(contenu);
>> > > >>>
>> > > >>> db.add_document(doc);
>> > > >>> cout << "add " << path.c_str() << endl;
>> > > >>>
>> > > >>> }
>> > > >>>
>> > > >>>
>> > > >>> }
>> > > >>> //Mise a jour
>> > > >>> cout << "Optimizing" << endl;
>> > > >>> db.flush();
>> > > >>> closedir(rep);
>> > > >>>
>> > > >>> //Heure de fin de l'operation
>> > > >>> gettimeofday(&tim, NULL);
>> > > >>> t2=tim.tv_sec+(tim.tv_usec/1000000.0);
>> > > >>>
>> > > >>> //Calcul de la duree de l'operation
>> > > >>> dif = t2 - t1;
>> > > >>> Calculate(dif);
>> > > >>>
>> > > >>>
>> > > >>> }
>> > > >>> catch (const Xapian::Error &e)
>> > > >>> {
>> > > >>> cout << e.get_description() << endl;
>> > > >>> }
>> > > >>> }
>> > > >>>
>> > > >>> Thanks for helping me
>> > > >>>
>> > > >>>
>> > > >>> 2008/11/25 Henry <henka at cityweb.co.za>
>> > > >>>
>> > > >>> > Quoting "Justine Demeyer" <justine.demeyer at gmail.com>:
>> > > >>> > > I have a question about the size of the Xapian index.
>> > > >>> > >
>> > > >>> > > I indexed a set of 200 000 data who has a global size of about
>> > 1Gb
>> > > >>> and
>> > > >>> > the
>> > > >>> > > index created has a size of more than 3Gb!! What can explain
>> this
>> > > >>> > > difference???
>> > > >>> >
>> > > >>> > You'll find this with all indexing systems, to some degree. The
>> > size
>> > > >>> > of your index is almost always larger than the raw text,
>> depending
>> > on
>> > > >>> > how you've structured the index/terms, whether you're
>> stopalizing,
>> > > >>> > etc, and also depends on whether you've compacted the DB.
>> > > >>> >
>> > > >>> > If you post more detail about your index then that will help to
>> > > >>> > pinpoint why your index is so large.
>> > > >>> >
>> > > >>> > Cheers
>> > > >>> > Henry
>> > > >>> >
>> > > >>> >
>> > > >>> > _______________________________________________
>> > > >>> > Xapian-discuss mailing list
>> > > >>> > Xapian-discuss at lists.xapian.org
>> > > >>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>> > > >>> >
>> > > >>> _______________________________________________
>> > > >>> Xapian-discuss mailing list
>> > > >>> Xapian-discuss at lists.xapian.org
>> > > >>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>> > > >>>
>> > > >>
>> > > >>
>> > > >
>> > > _______________________________________________
>> > > Xapian-discuss mailing list
>> > > Xapian-discuss at lists.xapian.org
>> > > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>> > >
>> > _______________________________________________
>> > Xapian-discuss mailing list
>> > Xapian-discuss at lists.xapian.org
>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>> >
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>
>
More information about the Xapian-discuss
mailing list