Question

我的python应用程序有问题，我认为它与python垃圾收集有关，即使我不确定...

问题是我的应用程序需要花费大量时间退出并切换到下一个功能。

在我的应用程序中，我处理非常大的字典，包含数千个大型对象，这些对象是从包装的C ++类中实例化的。

我在程序中放了一些时间戳输出，我看到在每个函数结束时，当函数内部创建的对象超出范围时，解释器在调用下一个函数之前花费了大量时间。我在应用程序结束时观察到同样的问题，程序应该退出：在屏幕上的最后一个时间戳和新提示的出现之间花费了很多时间（〜小时！）。

内存使用情况稳定，因此我确实没有内存泄漏。

有什么建议吗？

可能是数以千计的大型C ++对象的垃圾收集速度慢吗？

有加速的方法吗？

更新

非常感谢您的所有答案，您给了我很多调试代码的提示： - ）

我在Scientific Linux 5上使用Python 2.6.5，这是一个基于Red Hat Enterprise 5的自定义发行版。实际上我并没有使用SWIG为我们的C ++代码获取Python绑定，而是使用Reflex / PyROOT框架。我知道，它在粒子物理学之外并不是很有名（但仍然是开源的，可以免费获得），我必须使用它，因为它是我们主框架的默认设置。

在这种情况下，来自Python端的DEL命令不起作用，我已经尝试过了。 DEL只删除链接到C ++对象的python变量，而不是内存中的对象本身，它仍然由C ++端拥有...

......我知道，这不是我的标准，而且有点复杂，对不起:-P

但是按照你的提示，我会描述我的代码，我会按照你的建议给你回复更多细节。

其他更新：

好的，按照你的建议，我用cProfile检测了我的代码，我发现实际上gc.collect()函数是占用大部分运行时间的函数!!

此处cProfile + pstats print_stats（）的输出：


    >>> p.sort_stats("time").print_stats(20)
Wed Oct 20 17:46:02 2010    mainProgram.profile

         547303 function calls (542629 primitive calls) in 548.060 CPU seconds

   Ordered by: internal time
   List reduced from 727 to 20 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4  345.701   86.425  345.704   86.426 {gc.collect}
        1  167.115  167.115  200.946  200.946 PlotD3PD_v3.2.py:2041(PlotSamplesBranches)
       28   12.817    0.458   13.345    0.477 PlotROOTUtils.py:205(SaveItems)
     9900   10.425    0.001   10.426    0.001 PlotD3PD_v3.2.py:1973(HistoStyle)
     6622    5.188    0.001    5.278    0.001 PlotROOTUtils.py:403(__init__)
       57    0.625    0.011    0.625    0.011 {built-in method load}
      103    0.625    0.006    0.792    0.008 dbutils.py:41(DeadlockWrap)
       14    0.475    0.034    0.475    0.034 {method 'dump' of 'cPickle.Pickler' objects}
     6622    0.453    0.000    5.908    0.001 PlotROOTUtils.py:421(CreateCanvas)
    26455    0.434    0.000    0.508    0.000 /opt/root/lib/ROOT.py:215(__getattr__)
[...]

>>> p.sort_stats("cumulative").print_stats(20)
Wed Oct 20 17:46:02 2010    mainProgram.profile

         547303 function calls (542629 primitive calls) in 548.060 CPU seconds

   Ordered by: cumulative time
   List reduced from 727 to 20 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  548.068  548.068 PlotD3PD_v3.2.py:2492(main)
        4    0.000    0.000  346.756   86.689 /usr/lib//lib/python2.5/site-packages/guppy/heapy/Use.py:171(heap)
        4    0.005    0.001  346.752   86.688 /usr/lib//lib/python2.5/site-packages/guppy/heapy/View.py:344(heap)
        1    0.002    0.002  346.147  346.147 PlotD3PD_v3.2.py:2537(LogAndFinalize)
        4  345.701   86.425  345.704   86.426 {gc.collect}
        1  167.115  167.115  200.946  200.946 PlotD3PD_v3.2.py:2041(PlotBranches)
       28   12.817    0.458   13.345    0.477 PlotROOTUtils.py:205(SaveItems)
     9900   10.425    0.001   10.426    0.001 PlotD3PD_v3.2.py:1973(HistoStyle)
    13202    0.336    0.000    6.818    0.001 PlotROOTUtils.py:431(PlottingCanvases)
     6622    0.453    0.000    5.908    0.001 /root/svn_co/rbianchi/SoftwareDevelopment

[...]

>>>

因此，在两个输出中，分别按“时间”和“累积”时间排序，gc.collect()是消耗我程序运行时间最多的函数！ :-P

这是内存分析器Heapy的输出，就在返回main()程序之前。

memory usage before return:
Partition of a set of 65901 objects. Total size = 4765572 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  25437  39  1452444  30   1452444  30 str
     1   6622  10   900592  19   2353036  49 dict of PlotROOTUtils.Canvas
     2    109   0   567016  12   2920052  61 dict of module
     3   7312  11   280644   6   3200696  67 tuple
     4   6622  10   238392   5   3439088  72 0xa4ab74c
     5   6622  10   185416   4   3624504  76 PlotROOTUtils.Canvas
     6   2024   3   137632   3   3762136  79 types.CodeType
     7    263   0   129080   3   3891216  82 dict (no owner)
     8    254   0   119024   2   4010240  84 dict of type
     9    254   0   109728   2   4119968  86 type
  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    10   1917   3   107352   2   4264012  88 function
    11   3647   5   102116   2   4366128  90 ROOT.MethodProxy
    12    148   0    80800   2   4446928  92 dict of class
    13   1109   2    39924   1   4486852  93 __builtin__.wrapper_descriptor
    14    239   0    23136   0   4509988  93 list
    15     87   0    22968   0   4532956  94 dict of guppy.etc.Glue.Interface
    16    644   1    20608   0   4553564  94 types.BuiltinFunctionType
    17    495   1    19800   0   4573364  94 __builtin__.weakref
    18     23   0    11960   0   4585324  95 dict of guppy.etc.Glue.Share
    19    367   1    11744   0   4597068  95 __builtin__.method_descriptor

知道为什么或如何优化垃圾收集？

我可以做更详细的检查吗？

Answer 1

This is known garbage collector issue in Python 2.6在分配许多对象而不释放任何对象时导致垃圾收集的二次时间，即。人口众多。
有两个简单的解决方案：

在填充大型列表之前禁用垃圾收集并在之后启用它

l = []
gc.disable()
for x in xrange(10**6):
  l.append(x)
gc.enable()

或更新至Python 2.7, where the issue has been solved

我更喜欢第二种解决方案，但它并不总是一种选择;）

Answer 2

是的，它可能是垃圾收集，但也可能是与C ++代码的某些同步，或者完全不同的东西（很难说没有代码）。

无论如何，您应该查看SIG for development of Python/C++ integration以查找问题以及如何加快速度。

Answer 3

如果您的问题确实是垃圾收集，请尝试在使用del()完成后立即释放您的对象。

一般来说，这听起来不像垃圾收集问题，除非我们谈论的是太字节的内存。

我同意S.Lott ...介绍您的应用，然后提供代码片段及其结果，我们可以提供更多帮助。

Python垃圾收集可以那么慢吗？

3 个答案: