[Xapian-discuss] Size of the index

Justine Demeyer justine.demeyer at gmail.com
Tue Nov 25 10:32:29 GMT 2008


Here is the code of the index :

void Index(char* ind, char* directory)
{
       try
       {
           timeval tim;
           double t1, t2, dif;

           string index(ind);

           //Heure de debut de l'operation
           gettimeofday(&tim, NULL);
       t1=tim.tv_sec+(tim.tv_usec/1000000.0);

       //Creattion ou ouverture de l'index
       Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
       Xapian::TermGenerator indexer;
       Xapian::Stem stemmer("english");
       indexer.set_stemmer(stemmer);


       struct dirent *lecture;
       DIR *rep;

       rep = opendir(directory);
       while((lecture = readdir(rep)))
       {

               char* name = lecture->d_name;
               std::string name2(name);

               string path= directory+name2;

               ifstream fichier(path.c_str(), ios::in);

               if(fichier) // ce test Ã(c)choue si le fichier n'est pas
ouvert
               {
                   string ligne; // variable contenant chaque ligne lue
                       string contenu;

                   // cette boucle s'arrête dès qu'une erreur de lecture
survient
                       while(std::getline(fichier, ligne))
                       {
                           contenu = contenu + ligne + "\n";
                       }

                   //Indexation
                       Xapian::Document doc;
                       doc.set_data(contenu);

                       indexer.set_document(doc);
                       indexer.index_text(contenu);

                       db.add_document(doc);
                       cout << "add " << path.c_str() << endl;

               }


       }
       //Mise a jour
       cout << "Optimizing" << endl;
       db.flush();
       closedir(rep);

       //Heure de fin de l'operation
       gettimeofday(&tim, NULL);
       t2=tim.tv_sec+(tim.tv_usec/1000000.0);

       //Calcul de la duree de l'operation
       dif = t2 - t1;
       Calculate(dif);


   }
       catch (const Xapian::Error &e)
       {
               cout << e.get_description() << endl;
       }
}

Thanks for helping me


2008/11/25 Henry <henka at cityweb.co.za>

> Quoting "Justine Demeyer" <justine.demeyer at gmail.com>:
> > I have a question about the size of the Xapian index.
> >
> > I indexed a set of 200 000 data who has a global size of about 1Gb and
> the
> > index created has a size of more than 3Gb!! What can explain this
> > difference???
>
> You'll find this with all indexing systems, to some degree.  The size
> of your index is almost always larger than the raw text, depending on
> how you've structured the index/terms, whether you're stopalizing,
> etc, and also depends on whether you've compacted the DB.
>
> If you post more detail about your index then that will help to
> pinpoint why your index is so large.
>
> Cheers
> Henry
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>


More information about the Xapian-discuss mailing list