[Xapian-discuss] about index speed of xapian

superthread superthread at 126.com
Wed Nov 21 09:46:26 GMT 2012


hi,
i use xapian to index a txt file, it's size is 268M. i take each line as a document, and each line has two field like 13445511 | 111115151. the recored size is 10000000. the XAPIAN_FLUSH_THRESHOLD set 1000000. it takes 1026544ms to index the file, it is more slower than lucene. The lucene speed is about 40000 records per second.
code:
    try
    {  
        Xapian::WritableDatabase database("testindex", Xapian::DB_CREATE_OR_OPEN);
        mybase::Timeval now;
        std::string line;
        while (getline(fin, line))
        {  
            int pos = line.find('|');
            if (pos != std::string::npos)
            {  
                std::string imsi = line.substr(0, pos);
                std::string msisdn = line.substr(pos + 1);
                Xapian::Document doc;
                doc.add_term(imsi);
                doc.add_term(msisdn);
                database.add_document(doc);
            }  
        }  
        database.close();
        std::cout << now.elapsed() << std::endl;
    }  
    catch (const Xapian::Error& error)
    {  
        std::cout << error.get_msg() << std::endl;
    }  
 
the following is the index result:
total 1.9G
-rw-rw-r-- 1 warren warren    0 11-21 17:07 flintlock
-rw-rw-r-- 1 warren warren   28 11-21 17:07 iamchert
-rw-rw-r-- 1 warren warren  22K 11-21 17:24 postlist.baseA
-rw-rw-r-- 1 warren warren  20K 11-21 17:22 postlist.baseB
-rw-rw-r-- 1 warren warren 1.4G 11-21 17:24 postlist.DB
-rw-rw-r-- 1 warren warren 2.0K 11-21 17:24 record.baseA
-rw-rw-r-- 1 warren warren 1.8K 11-21 17:22 record.baseB
-rw-rw-r-- 1 warren warren 121M 11-21 17:24 record.DB
-rw-rw-r-- 1 warren warren 6.7K 11-21 17:24 termlist.baseA
-rw-rw-r-- 1 warren warren 6.1K 11-21 17:22 termlist.baseB
-rw-rw-r-- 1 warren warren 428M 11-21 17:24 termlist.DB
 
too big!
 
is there any problem about my code, and is there any way to impove index speed?
thank you


More information about the Xapian-discuss mailing list