我有一些Python代码通过数值模拟生成大量数据。该代码将Numpy用于大量计算,并将Pandas用于许多顶级数据。数据集很大,因此代码运行缓慢,现在我正在尝试查看是否可以使用cProfile查找并修复一些热点。
问题在于,cProfile正在将许多热点识别为Pandas,Numpy和/或Python内置程序中的代码段。这是cProfile统计信息,按“ tottime”(函数本身内的总时间)排序。请注意,由于代码本身不归我所有,并且我无权共享详细信息,因此我混淆了项目名称和文件名。
foo.sort_stats('tottime').print_stats(50)
Wed Jun 5 13:18:28 2019 c:\localwork\xxxxxx\profile_data
297514385 function calls (291105230 primitive calls) in 306.898 seconds
Ordered by: internal time
List reduced from 4141 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
281307 31.918 0.000 34.731 0.000 {pandas._libs.lib.infer_dtype}
800 31.443 0.039 31.476 0.039 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\numpy\lib\function_base.py:4703(delete)
109668 23.837 0.000 23.837 0.000 {method 'clear' of 'dict' objects}
153481 19.369 0.000 19.369 0.000 {method 'ravel' of 'numpy.ndarray' objects}
5861614 14.182 0.000 78.492 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
5861614 8.891 0.000 8.891 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
5861614 8.376 0.000 99.084 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
26840695 7.032 0.000 11.009 0.000 {built-in method builtins.isinstance}
26489324 6.547 0.000 14.410 0.000 {built-in method builtins.getattr}
11846279 6.177 0.000 19.809 0.000 {pandas._libs.lib.values_from_object}
[...]
我是否有一种明智的方法来找出我的代码的哪些部分过度依赖于这些库函数和内置函数?我期望一个答案是“查看累计时间统计信息,这可能表明这些昂贵的电话来自何处”。累积的时间提供了一些洞察力:
foo.sort_stats('cumulative').print_stats(50)
Wed Jun 5 13:18:28 2019 c:\localwork\xxxxxx\profile_data
297514385 function calls (291105230 primitive calls) in 306.898 seconds
Ordered by: cumulative time
List reduced from 4141 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
643/1 0.007 0.000 307.043 307.043 {built-in method builtins.exec}
1 0.000 0.000 307.043 307.043 xxxxxx.py:1(<module>)
1 0.002 0.002 306.014 306.014 xxxxxx.py:264(write_xxx_data)
1 0.187 0.187 305.991 305.991 xxxxxx.py:256(write_yyyy_data)
1 0.077 0.077 305.797 305.797 xxxxxx.py:250(make_zzzzzzz)
1 0.108 0.108 187.845 187.845 xxxxxx.py:224(generate_xyzxyz)
108223 1.977 0.000 142.816 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:298(_setitem_with_indexer)
1 0.799 0.799 126.733 126.733 xxxxxx.py:63(populate_abcabc_data)
1 0.030 0.030 117.874 117.874 xxxxxx.py:253(<listcomp>)
7201 0.077 0.000 116.612 0.016 C:\LocalWork\xxxxxx\yyyyyyyyyyyy.py:234(xxx_yyyyyy)
108021 0.497 0.000 112.908 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexing.py:182(__setitem__)
5861614 8.376 0.000 99.084 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\series.py:764(__getitem__)
110024 0.917 0.000 81.210 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3500(apply)
108021 0.185 0.000 80.685 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:3692(setitem)
5861614 14.182 0.000 78.492 0.000 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\indexes\base.py:3090(get_value)
108021 1.887 0.000 73.064 0.001 C:\LocalWork\WPy-3710\python-3.7.1.amd64\lib\site-packages\pandas\core\internals.py:819(setitem)
[...]
有没有一种确定热点的好方法-比“爬到xxxxxx.py并搜索熊猫可能推断出数据类型并且Numpy可能删除对象的所有位置”更好? / p>