我目前正在将Java / Hibernate系统移植到Python,使用Pandas数据帧将其运行的数据存储在内存中。我的代码目前运行得太慢了。对代码进行分析表明,此功能是一个瓶颈:
def find_names_in_explode(id, normalized_name, data):
exploded_names = utils.explode_to_matchable_names(normalized_name)
# This is a trick to use binary search on a dataframe.
# I found out about it from https://www.youtube.com/watch?v=R2LiVJLGAHE.
# Make sure the full data frame is sorted before passing in.
start_and_end_strings = [(name[0:-1] + chr(ord(name[-1]) - 1),
name[0:-1] + chr(ord(name[-1]) + 1))
for name in exploded_names if name]
all_chunks = []
for start_and_end in start_and_end_strings:
start_index, end_index = data['name'].searchsorted(start_and_end)
if start_index > 0 or end_index < data['name'].size:
# searchsorted will return the whole data frame if it doesn't
# find any matches; we're assuming that's never what we want.
all_chunks.append(data.iloc[start_index : end_index])
all_rows = pd.concat(all_chunks)
return all_rows[(all_rows['id'] != id) & (~all_rows['id'].duplicated())]
将对主数据框的每一行运行此函数。主数据框的行都有名称和id(以及其他非相关列)。对于每一行,此函数将生成一组相关名称(例如,如果输入名称为“John R. Smith,MD”,则该集将包含“John Smith”,“John R. Smith”,“John Smith MD”等等,并找到数据框中与生成的集合中的一个名称匹配的所有行,将这些结果编译成新的数据帧,然后将其关闭以进行进一步处理。之前使用isin
的更简单版本的速度太慢,所以我尝试使用this video中的技巧来加快速度,以进行二分搜索而不是线性搜索。这使它更快,但它还不够快。
此函数是主数据帧行上的二次循环的内部运算,我无法弄清楚如何进行向量化。我在具有一百万行的数据帧上对此进行了性能测试,但最终目标是在具有大约2亿行(大约60 GB的数据)的数据帧上运行该系统,因此它需要非常快。以下是对一百万行测试的分析结果的一部分:
1877294411 function calls (1842146549 primitive calls) in 7907.774 seconds
Ordered by: cumulative time
List reduced from 3253 to 30 due to restriction <30>
ncalls tottime percall cumtime percall filename:lineno(function)
3 0.352 0.117 7909.577 2636.526 main.py:1(<module>)
50516/1 23.000 0.000 7909.574 7909.574 {built-in method builtins.exec}
1 8.653 8.653 7908.866 7908.866 main.py:40(main)
660206 9.805 0.000 7761.765 0.012 name_matchers.py:100(match)
651044 18.768 0.000 7549.588 0.012 name_matchers.py:80(find_names_in_explode)
1249038/998908 3.316 0.000 3754.752 0.004 indexing.py:1317(__getitem__)
2050871 8.807 0.000 3730.645 0.002 internals.py:2779(__init__)
1210664 2.580 0.000 3722.063 0.003 indexing.py:1720(_getitem_axis)
960534 1.908 0.000 3699.325 0.004 indexing.py:1689(_get_slice_axis)
710404 0.744 0.000 3688.891 0.005 indexing.py:141(_slice)
710404 3.949 0.000 3688.148 0.005 generic.py:1742(_slice)
2050871 34.184 0.000 3679.895 0.002 internals.py:2876(_rebuild_blknos_and_blklocs)
710404 5.405 0.000 3674.365 0.005 internals.py:3384(get_slice)
4147176 3598.437 0.001 3598.437 0.001 {method 'fill' of 'numpy.ndarray' objects}
689418 7.679 0.000 2698.926 0.004 ops.py:809(wrapper)
689418 2557.017 0.004 2598.529 0.004 ops.py:755(na_op)
7787062 45.116 0.000 366.018 0.000 series.py:139(__init__)
651044 2.530 0.000 338.349 0.001 concat.py:21(concat)
2740288 11.580 0.000 327.501 0.000 frame.py:1940(__getitem__)
651046 3.578 0.000 255.852 0.000 frame.py:1983(_getitem_array)
689418 6.704 0.000 247.067 0.000 ops.py:909(wrapper)
689420 6.476 0.000 241.191 0.000 generic.py:1909(take)
651044 5.944 0.000 221.647 0.000 concat.py:356(get_result)
651044 2.388 0.000 205.428 0.000 internals.py:4814(concatenate_block_managers)
689421 7.032 0.000 202.054 0.000 internals.py:3990(take)
2089239 3.209 0.000 199.700 0.000 _decorators.py:65(wrapper)
1378836 4.680 0.000 198.079 0.000 ops.py:913(<lambda>)
651044 1.358 0.000 170.559 0.000 name_matchers.py:59(name_match)
689421 3.174 0.000 156.992 0.000 internals.py:3860(reindex_indexer)
4089923 18.004 0.000 134.581 0.000 series.py:2894(_sanitize_array)
函数match
调用find_names_in_explode
,您可以看到其累计运行时间的大部分都花在那里。有什么方法可以更好地利用熊猫或Numpy来加快速度吗?