[Xapian-tickets] [Xapian] #637: Potential memory leak when assigning MSetItem values

Xapian nobody at xapian.org
Mon Mar 17 22:02:25 GMT 2014


#637: Potential memory leak when assigning MSetItem values
--------------------------------------+----------------------------
 Reporter:  jeffrand                  |             Owner:  richard
     Type:  defect                    |            Status:  new
 Priority:  normal                    |         Milestone:  1.2.x
Component:  Xapian-bindings (Python)  |           Version:  1.2.15
 Severity:  normal                    |        Resolution:
 Keywords:  Memory leak               |        Blocked By:
 Blocking:                            |  Operating System:  Linux
--------------------------------------+----------------------------
\
\
\
\
\
\

Old description:

> I've traced a memory leak to a statement which assigns the values from an
> MSetItem to a dictionary which is then appended to a list in python.
> We're running python 2.7.3, xapian-core 1.2.15 and xapian-bindings
> 1.2.15. I've provided an example which reproduces the behavior below. The
> example prints the PID and has a few statements waiting for input to make
> observing the behavior easier.
>
> Run the following code and monitor the PID's memory usage in top or a
> similar program. I've observed the resident memory for this example go
> from 18m to 52m after deleting objects and running garbage collection.
>
> I think the MSetItems are preserved in memory and are not being garbage
> collected correctly, possibly from a lingering reference to the MSet or
> MSetIterator.
>

> import os
> import simplejson as json
> import xapian as x
> import shutil
> import gc
>
> def make_db(path, num_docs=100000):
>     try:
>         shutil.rmtree(path)
>     except OSError, e:
>         if e.errno != 2:
>             raise
>
>     db = x.WritableDatabase(path, x.DB_CREATE)
>     for i in xrange(1, num_docs):
>         doc = x.Document()
>         doc.set_data(json.dumps({ 'id': i, 'enabled': True }))
>         doc.add_term('XTYPA')
>         db.add_document(doc)
>     return db
>
> def run_query(db, num_docs=100000):
>     e = x.Enquire(db)
>     e.set_query(x.Query('XTYPA'))
>     m = e.get_mset(0, num_docs, True, None)
>
>     # Store the MSetItem's data, which causes a memory leak
>     data = []
>     for i in m:
>         data.append({ 'data': i.document.get_data(), 'id': i.docid, })
>
>     # Make sure I'm not crazy
>     del num_docs, db, i, e, m, data
>     gc.collect()
>
> def main():
>     # print the PID to monitor
>     print 'PID to monitor: {}'.format(os.getpid())
>
>     db = make_db('/tmp/test.db')
>     raw_input("database is done, ready?")
>
>     run_query(db, 100000)
>     raw_input('done?')
>
> if __name__ == '__main__':
>     main()

New description:

 I've traced a memory leak to a statement which assigns the values from an
 MSetItem to a dictionary which is then appended to a list in python. We're
 running python 2.7.3, xapian-core 1.2.15 and xapian-bindings 1.2.15. I've
 provided an example which reproduces the behavior below. The example
 prints the PID and has a few statements waiting for input to make
 observing the behavior easier.

 Run the following code and monitor the PID's memory usage in top or a
 similar program. I've observed the resident memory for this example go
 from 18m to 52m after deleting objects and running garbage collection.

 I think the MSetItems are preserved in memory and are not being garbage
 collected correctly, possibly from a lingering reference to the MSet or
 MSetIterator.

 {{{
 #!python
 import os
 import simplejson as json
 import xapian as x
 import shutil
 import gc

 def make_db(path, num_docs=100000):
     try:
         shutil.rmtree(path)
     except OSError, e:
         if e.errno != 2:
             raise

     db = x.WritableDatabase(path, x.DB_CREATE)
     for i in xrange(1, num_docs):
         doc = x.Document()
         doc.set_data(json.dumps({ 'id': i, 'enabled': True }))
         doc.add_term('XTYPA')
         db.add_document(doc)
     return db

 def run_query(db, num_docs=100000):
     e = x.Enquire(db)
     e.set_query(x.Query('XTYPA'))
     m = e.get_mset(0, num_docs, True, None)

     # Store the MSetItem's data, which causes a memory leak
     data = []
     for i in m:
         data.append({ 'data': i.document.get_data(), 'id': i.docid, })

     # Make sure I'm not crazy
     del num_docs, db, i, e, m, data
     gc.collect()

 def main():
     # print the PID to monitor
     print 'PID to monitor: {}'.format(os.getpid())

     db = make_db('/tmp/test.db')
     raw_input("database is done, ready?")

     run_query(db, 100000)
     raw_input('done?')

 if __name__ == '__main__':
     main()
 }}}

--
\
\

Comment (by olly):

 If you ask the python gc module now many objects are allocated, it doesn't
 increase.  The attached slightly modified version of your script shows
 this (note calling {{{gc.collect()}}} more than once sometimes seems to be
 necessary to actually collect all objects - not sure why).

 On trunk:

 {{{
 $ ./run-python-test ticket637.py
 PID to monitor: 4107
 database is done, ready?
 num objects before =  7519
 num objects after =  7519
 done?
 $
 }}}

 And HEAD of 1.2 branch:

 {{{
 $ PYTHONPATH=. python ticket637.py
 PID to monitor: 972
 database is done, ready?
 num objects before =  7115
 num objects after =  7115
 done?
 }}}

 So I don't see how this can be Python hanging on to objects.

 I think this is just due to C++'s allocator hanging on to memory.  As I
 said in my reply to the mailing list, this memory should just get reused
 by later operations (like the next query you run).
\
\
\

--
Ticket URL: <http://trac.xapian.org/ticket/637#comment:1>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list