计算由长度不等的索引的二维列表给出的DataFrame行组的平均值

时间:2017-08-10 23:00:06

标签: python pandas numpy

我有一个包含n行的DataFrame。我还有一个二维索引数组。该数组也有n行,但每行的长度可以变化。我需要根据索引对DataFrame行进行分组并计算列的平均值。

例如:

如果我有DataFrame df和array ind,我需要得到

[df.loc[ind[n], col_name].mean() for n in ind]

我使用apply pandas函数实现了这个:

size = 100000
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)

但这很慢并且扩展得很差。在这种情况下,执行

的速度要快得多
df.a.values[ind].mean(axis=1)

但是,据我所知,这只是因为ind的所有元素都是相同的长度,并且以下代码不起作用:

new_ind = ind.tolist()
new_ind[0].pop()
df.a.values[new_ind].mean(axis=1)

我玩弄了大熊猫groupby方法,但没有成功。是否有另一种有效的方法可以根据长度不等的索引列表对行进行分组并返回列的平均值?

1 个答案:

答案 0 :(得分:1)

我认为这就是你可能会追求的......我将尺寸设置得更低,以便更容易演示

以下是您的代码的缩短版本,其中包含可重复(固定)ind,您可以对其进行测试

import pandas as pd
import numpy as np
size = 10
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
ind = np.array([[5, 8, 9, 5, 0],
       [0, 1, 7, 6, 9],
       [2, 4, 5, 2, 4],
       [2, 4, 7, 7, 9],
       [1, 7, 0, 6, 9],
       [9, 7, 6, 9, 1],
       [0, 1, 8, 8, 3],
       [9, 8, 7, 3, 6],
       [5, 1, 9, 3, 4],
       [8, 1, 4, 0, 3]])
def group(row):
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)

以下也给出了相同的

df['comparison'] = df.a.values[ind].mean(axis=1)

In [86]: (df['comparison'] == df['avg']).all()
Out[86]: True

计时

  • 0.5263588428497314之前
  • 0.014391899108886719
  • 之后
  • 使用bincount 0.03328204154968262

比较和缩放

enter image description here

为了比较缩放我设置了三个timeit函数(底部的代码),我定义了我想要测试缩放的尺寸

import timeit
sizes = [10, 100, 1000, 10000]
res_mine = map(mine, sizes)
res_bincount = map(bincount, sizes)
res_original = map(original, sizes[:-1])

计时代码

def bincount(size):
    return min(timeit.repeat(
        """lengths = np.array([len(x) for x in ind])
positions = np.arange(len(ind))
values = df.a.values
avg = np.bincount(positions.repeat(lengths), values[np.concatenate(ind)]) / lengths
df.assign(avg=avg)""",
        """import pandas as pd
import numpy as np
size = {size}
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
    number=100, repeat=10))

def original(size):
    return min(timeit.repeat(
        """df['avg'] = df.apply(group, axis=1)""",
        """import pandas as pd
import numpy as np    
size = {size}             
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)       
np.random.seed(1)               
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):                                                          
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
    repeat=10, number=1))

def mine(size):
    return min(timeit.repeat("""df['comparison'] = df.a.values[ind].mean(axis=1)""",
        """import pandas as pd
import numpy as np    
size = {size}             
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)       
np.random.seed(1)               
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):                                                          
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()""".format(size=size),
        repeat=100, number=10))

import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes()
ax.plot(sizes, res_mine, label='mine')
ax.plot(sizes, res_bincount, label='bincount')
ax.plot(sizes[:-1], res_original, label='original')
plt.yscale('log')
plt.xscale('log')
plt.legend()
plt.xlabel('size of dataframe')
plt.ylabel('run time (s)')
plt.show()

请注意,我需要减少原版的运行,因为它需要很长时间