[Xapian-devel] Reducing Xapian memory usage
Vishesh Handa
me at vhanda.in
Fri Mar 28 17:03:18 GMT 2014
Hey guys
I noticed xapian using a lot of memory while indexing [1] so I decided to look
at the bottle necks and where this can be improved.
Here are some large spots that I noticed (Chert) -
1. Every document has map<string, OmDocumentTerm> and OmDocumentTerm contains
the same string again. This results in every term being stored in memory
twice. Additionally multiple documents may have the same terms, and each of
them would have their own copies to the string, even if the term is the same.
2. Spelling db - It too allocates std::strings again
3. database mod_plist - ditto
4. When fetching the terms for a document the entire term list is loaded in
one go. This causes a huge block on memory to be loaded. Depending on the
number of terms in a document, it can get quite bad. We might want to do this
in smaller chunks.
--
The main way that I think this can be fixed is by introducing a new string
class which uses reference counting. That way (1) will easily be solved.
Regarding (2), (3), it probably makes sense to have a set<XapianString> in the
Database class. When spellings, documents, position lists are loaded into
memory, the string is fetched from that set, this way all strings will only
have 1 copy of themselves.
I'm not too sure how this would relate to the public API though. Opinions?
--
Vishesh Handa
[1] http://lists.xapian.org/pipermail/xapian-discuss/2014-March/009079.html
More information about the Xapian-devel
mailing list