我有一个包含n行的DataFrame。我还有一个二维索引数组。该数组也有n行,但每行的长度可以变化。我需要根据索引对DataFrame行进行分组并计算列的平均值。
例如:
如果我有DataFrame df和array ind,我需要得到
[df.loc[ind[n], col_name].mean() for n in ind]
。
我使用apply
pandas函数实现了这个:
size = 100000
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)
但这很慢并且扩展得很差。在这种情况下,执行
的速度要快得多df.a.values[ind].mean(axis=1)
但是,据我所知,这只是因为ind的所有元素都是相同的长度,并且以下代码不起作用:
new_ind = ind.tolist()
new_ind[0].pop()
df.a.values[new_ind].mean(axis=1)
我玩弄了大熊猫groupby方法,但没有成功。是否有另一种有效的方法可以根据长度不等的索引列表对行进行分组并返回列的平均值?
答案 0 :(得分:1)
我认为这就是你可能会追求的......我将尺寸设置得更低,以便更容易演示
以下是您的代码的缩短版本,其中包含可重复(固定)ind
,您可以对其进行测试
import pandas as pd
import numpy as np
size = 10
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
ind = np.array([[5, 8, 9, 5, 0],
[0, 1, 7, 6, 9],
[2, 4, 5, 2, 4],
[2, 4, 7, 7, 9],
[1, 7, 0, 6, 9],
[9, 7, 6, 9, 1],
[0, 1, 8, 8, 3],
[9, 8, 7, 3, 6],
[5, 1, 9, 3, 4],
[8, 1, 4, 0, 3]])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)
以下也给出了相同的
df['comparison'] = df.a.values[ind].mean(axis=1)
In [86]: (df['comparison'] == df['avg']).all()
Out[86]: True
0.5263588428497314
之前0.014391899108886719
bincount
0.03328204154968262
为了比较缩放我设置了三个timeit
函数(底部的代码),我定义了我想要测试缩放的尺寸
import timeit
sizes = [10, 100, 1000, 10000]
res_mine = map(mine, sizes)
res_bincount = map(bincount, sizes)
res_original = map(original, sizes[:-1])
def bincount(size):
return min(timeit.repeat(
"""lengths = np.array([len(x) for x in ind])
positions = np.arange(len(ind))
values = df.a.values
avg = np.bincount(positions.repeat(lengths), values[np.concatenate(ind)]) / lengths
df.assign(avg=avg)""",
"""import pandas as pd
import numpy as np
size = {size}
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
number=100, repeat=10))
def original(size):
return min(timeit.repeat(
"""df['avg'] = df.apply(group, axis=1)""",
"""import pandas as pd
import numpy as np
size = {size}
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
repeat=10, number=1))
def mine(size):
return min(timeit.repeat("""df['comparison'] = df.a.values[ind].mean(axis=1)""",
"""import pandas as pd
import numpy as np
size = {size}
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
repeat=100, number=10))
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes()
ax.plot(sizes, res_mine, label='mine')
ax.plot(sizes, res_bincount, label='bincount')
ax.plot(sizes[:-1], res_original, label='original')
plt.yscale('log')
plt.xscale('log')
plt.legend()
plt.xlabel('size of dataframe')
plt.ylabel('run time (s)')
plt.show()
请注意,我需要减少原版的运行,因为它需要很长时间