[Xapian-devel] Reducing Xapian memory usage

Vishesh Handa me at vhanda.in
Fri Mar 28 17:03:18 GMT 2014


Hey guys

I noticed xapian using a lot of memory while indexing [1] so I decided to look 
at the bottle necks and where this can be improved.

Here are some large spots that I noticed (Chert) -

1. Every document has map<string, OmDocumentTerm> and OmDocumentTerm contains 
the same string again. This results in every term being stored in memory 
twice. Additionally multiple documents may have the same terms, and each of 
them would have their own copies to the string, even if the term is the same.

2. Spelling db - It too allocates std::strings again

3. database mod_plist - ditto

4. When fetching the terms for a document the entire term list is loaded in 
one go. This causes a huge block on memory to be loaded. Depending on the 
number of terms in a document, it can get quite bad. We might want to do this 
in smaller chunks.

--

The main way that I think this can be fixed is by introducing a new string 
class which uses reference counting. That way (1) will easily be solved.

Regarding (2), (3), it probably makes sense to have a set<XapianString> in the 
Database class. When spellings, documents, position lists are loaded into 
memory, the string is fetched from that set, this way all strings will only 
have 1 copy of themselves.

I'm not too sure how this would relate to the public API though. Opinions?

-- 
Vishesh Handa

[1] http://lists.xapian.org/pipermail/xapian-discuss/2014-March/009079.html



More information about the Xapian-devel mailing list