我注意到在调用groupby并申请pandas数据帧时效率非常慢(比使用纯python慢100倍)。我的数据是一系列不同长度但嵌套深度的嵌套列表,我通过为列表索引添加列来转换为数据帧:
import pandas as pd
from random import randint
# original data
data1 = [[[[randint(0, 10) for i in range(randint(1, 3))] for i in range(randint(1, 5))] for i in range(500)] for i in range(3)]
# as a DataFrame
data2 = pd.DataFrame(
[
(i1, i2, i3, i4, x4)
for (i1, x1) in enumerate(data1)
for (i2, x2) in enumerate(x1)
for (i3, x3) in enumerate(x2)
for (i4, x4) in enumerate(x3)
],
columns = ['i1', 'i2', 'i3', 'i4', 'x']
)
# with indexing
data3 = data2.set_index(['i1', 'i2', 'i3']).sort_index()
示例数据:
>>> data3
i4 x
i1 i2 i3
0 0 0 0 8
0 1 0
0 2 4
1 0 4
2 0 7
3 0 6
4 0 10
4 1 1
4 2 8
1 0 0 8
0 1 9
0 2 1
1 0 5
2 0 9
2 0 0 1
1 0 1
1 1 4
1 2 0
2 0 6
2 1 10
2 2 8
3 0 4
3 1 5
4 0 3
4 1 6
3 0 0 9
0 1 8
0 2 7
1 0 2
1 1 9
... .. ..
2 495 0 0 1
0 1 6
0 2 5
1 0 1
1 1 8
1 2 6
496 0 0 4
0 1 8
0 2 3
497 0 0 3
0 1 10
1 0 9
2 0 6
2 1 1
2 2 3
3 0 0
4 0 10
498 0 0 9
0 1 1
1 0 2
1 1 10
2 0 2
2 1 2
2 2 2
3 0 9
499 0 0 0
0 1 2
1 0 2
1 1 8
2 0 6
[8901 rows x 2 columns]
我想在最里面的列表中应用一个函数。在下面的例子中,函数分别对每一行进行操作,但我的实际代码需要整个组使用,因此需要groupby / apply。
%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 7.52 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 4.02 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 8.86 s per loop
使用pandas的代码比直接使用列表慢几个数量级。有人能指出我做错了什么吗?我正在使用pandas 0.18.1。
答案 0 :(得分:1)
apply
是一种不得已而且非常慢的方法,因为它将每次迭代的整个数据帧组传递给自定义函数。在您的特定情况下,您无需申请,因为您只需将两列相乘。分组在这里没有效果。如果可以,请先尝试使用矢量化函数,然后在分组时使用agg
然后transform
。
您只需执行data2['i4'] * data2['x']
而不是您的groupby并申请。
%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 4.51 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 1.69 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 3.31 s per loop
%timeit data2['i4'] * data2['x']
10000 loops, best of 3: 122 µs per loop