数据帧的生成如下。计算多少个值大于按[“ A”,“ B”]分组的平均值的最佳方法是什么?
import numpy as np
import pandas as pd
keys = np.array([['A', 'B'], ['A', 'B'], ['A', 'B'],
['A', 'B'], ['C', 'D'], ['C', 'D'],
['C', 'D'], ['E', 'F'], ['E', 'F'],
['G', 'H']])
df = (pd.DataFrame(np.hstack([keys,np.random.randn(10,4).round(2)]),
columns = ['col1', 'col2', 'col3',
'col4', 'col5', 'col6'])
.astype({'col3': float,
'col4': float,
'col5': float,
'col6': float}))
我试图先计算平均值,然后将分组数据帧加入到原始数据帧中,然后执行sum(1)。但这似乎有点乏味。
df2 = pd.merge(df.groupby(["col1", "col2"]).mean(), df, left_on=["col1", "col2"], right_on=["col1", "col2"])
答案 0 :(得分:2)
您可以这样做:
(df[['col3', 'col4', 'col5', 'col6']]>df.groupby(['col1','col2']).transform('mean')).sum()
col3 4
col4 6
col5 3
col6 6
dtype: int64
答案 1 :(得分:2)
您将需要agg
方法
In [28]: df.groupby(['col1', 'col2']).agg(lambda x: (x > x.mean()).sum())
Out[28]:
col3 col4 col5 col6
col1 col2
A B 1.0 2.0 2.0 2.0
C D 2.0 2.0 2.0 2.0
E F 1.0 1.0 1.0 1.0
G H 0.0 0.0 0.0 0.0
本质上,x
将类似于数组。如果元素大于平均值,则x > x.mean()
给出True,否则返回0,然后sum
计算True的数量。
答案 2 :(得分:1)
numpy.add.at
和pandas.factorize
cols = ['col1', 'col2']
i, r = pd.factorize([*zip(*map(df.get, cols))])
v = df.drop(cols, 1).values
n, m = shape = len(r), v.shape[1]
b = np.zeros(shape)
c = np.zeros(shape)
d = np.zeros(shape, np.int64)
i0, i1 = i.repeat(m), np.tile(np.arange(m), len(v))
np.add.at(b, (i0, i1), v.ravel())
np.add.at(c, (i0, i1), 1)
np.add.at(d, (i0, i1), (v > (b / c)[i]).ravel())
pd.DataFrame(
d, pd.MultiIndex.from_tuples(r, names=cols),
df.columns.difference(cols)
)
col3 col4 col5 col6
col1 col2
A B 2 2 3 2
C D 2 1 1 2
E F 1 1 1 1
G H 0 0 0 0