Question

我正在尝试创建一个新列，它会在特定组发生时为我提供计数。我正在做类似以下的事情......

import pandas as pd 

table = '''A B C
1 1 1
1 1 2
1 1 4
2 1 3
2 1 5'''

df = pd.DataFrame([t.split(' ') for t in table.split('\n')[1:]], 
        columns=table.split('\n')[0].split(' '))

def appendCnt(df, factors):
    f = 'counts-'+ '-'.join(factors)
    df[f] = 0
    for k, v in df.groupby(factors):
        df[f].ix[v.index] = len(v)
    return df

factors = ['A', 'B']
print appendCnt(df, factors)

当我运行此代码时，速度慢得令人无法接受：

In [7]: run test
   A  B  C  counts-A-B
0  1  1  1           3
1  1  1  2           3
2  1  1  4           3
3  2  1  3           2
4  2  1  5           2

In [8]: %timeit for _ in xrange(5): appendCnt1(df, factors)
1 loops, best of 3: 225 ms per loop

似乎大部分时间花在写出表格中的新数据上。是否有更快的方法来实现这一目标？我觉得必须有一种方法可以更快地做到这一点，因为这真的是一个基本的操作......

Answer 1

如果我理解了您想要的内容，可以使用transform：

df['counts-'+ '-'.join(factors)] = df.groupby(factors).transform("count")

df
Out[6]: 
   A  B  C  counts-A-B
0  1  1  1           3
1  1  1  2           3
2  1  1  4           3
3  2  1  3           2
4  2  1  5           2

有效地计算Pandas中的分组元素

1 个答案: