Question

我有一些Python代码通过数值模拟生成大量数据。该代码将Numpy用于大量计算，并将Pandas用于许多顶级数据。数据集很大，因此代码运行缓慢，现在我正在尝试查看是否可以使用cProfile查找并修复一些热点。

问题在于，cProfile正在将许多热点识别为Pandas，Numpy和/或Python内置程序中的代码段。这是cProfile统计信息，按“ tottime”（函数本身内的总时间）排序。请注意，由于代码本身不归我所有，并且我无权共享详细信息，因此我混淆了项目名称和文件名。

foo.sort_stats('tottime').print_stats(50)
Wed Jun  5 13:18:28 2019    c:\localwork\xxxxxx\profile_data

         297514385 function calls (291105230 primitive calls) in 306.898 seconds

   Ordered by: internal time
   List reduced from 4141 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   281307   31.918    0.000   34.731    0.000 {pandas._libs.lib.infer_dtype}
      800   31.443    0.039   31.476    0.039 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\numpy\lib\function_base.py:4703(delete)
   109668   23.837    0.000   23.837    0.000 {method 'clear' of 'dict' objects}
   153481   19.369    0.000   19.369    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  5861614   14.182    0.000   78.492    0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
  5861614    8.891    0.000    8.891    0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
  5861614    8.376    0.000   99.084    0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
 26840695    7.032    0.000   11.009    0.000 {built-in method builtins.isinstance}
 26489324    6.547    0.000   14.410    0.000 {built-in method builtins.getattr}
 11846279    6.177    0.000   19.809    0.000 {pandas._libs.lib.values_from_object}
[...]

我是否有一种明智的方法来找出我的代码的哪些部分过度依赖于这些库函数和内置函数？我期望一个答案是“查看累计时间统计信息，这可能表明这些昂贵的电话来自何处”。累积的时间提供了一些洞察力：

foo.sort_stats('cumulative').print_stats(50)
Wed Jun  5 13:18:28 2019    c:\localwork\xxxxxx\profile_data

         297514385 function calls (291105230 primitive calls) in 306.898 seconds

   Ordered by: cumulative time
   List reduced from 4141 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    643/1    0.007    0.000  307.043  307.043 {built-in method builtins.exec}
        1    0.000    0.000  307.043  307.043 xxxxxx.py:1(<module>)
        1    0.002    0.002  306.014  306.014 xxxxxx.py:264(write_xxx_data)
        1    0.187    0.187  305.991  305.991 xxxxxx.py:256(write_yyyy_data)
        1    0.077    0.077  305.797  305.797 xxxxxx.py:250(make_zzzzzzz)
        1    0.108    0.108  187.845  187.845 xxxxxx.py:224(generate_xyzxyz)
   108223    1.977    0.000  142.816    0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:298(_setitem_with_indexer)
        1    0.799    0.799  126.733  126.733 xxxxxx.py:63(populate_abcabc_data)
        1    0.030    0.030  117.874  117.874 xxxxxx.py:253(<listcomp>)
     7201    0.077    0.000  116.612    0.016 C:\LocalWork\xxxxxx\yyyyyyyyyyyy.py:234(xxx_yyyyyy)
   108021    0.497    0.000  112.908    0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:182(__setitem__)
  5861614    8.376    0.000   99.084    0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
   110024    0.917    0.000   81.210    0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3500(apply)
   108021    0.185    0.000   80.685    0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3692(setitem)
  5861614   14.182    0.000   78.492    0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
   108021    1.887    0.000   73.064    0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:819(setitem)
[...]

有没有一种确定热点的好方法-比“爬到xxxxxx.py并搜索熊猫可能推断出数据类型并且Numpy可能删除对象的所有位置”更好？ / p>

如何使用Pandas和Numpy从cProfile中提取有用的信息？

0 个答案: