我有一个数据框,如下所示。我想使用“ part1”列作为基准,将数据分类为3个部分(每个部分具有相同的数字数据集)并计算每个组的part2平均值。如row0和row1作为groupB,平均值为(0.67 +(-0.03))/ 2。
import pandas as pd
df = pd.DataFrame({
"date":["20130101","20130101","20130103","20130103","20130105","20130105"],
"part1":[0.5,0.7,1.3,1.5,0.1,0.3],
"part2":[0.67,-0.03,1.95,-3.25,-0.3,0.6]
})
date part1 part2 output
0 20130101 0.5 0.67 0.32
1 20130101 0.7 -0.03 0.32
2 20130103 1.3 1.95 -0.65
3 20130103 1.5 -3.25 -0.65
4 20130105 0.1 -0.3 0.15
5 20130105 0.3 0.6 0.15
答案 0 :(得分:0)
如果您的数据存储在pandas
中,则可以使用df
进行操作:
def foo(x,n=3):
df = x.copy()
bins = np.quantile(df['part1'],np.linspace(0,1,n+1))
df['tmp'] = 0
for i in range(n):
idx = (df['part1'] > bins[i]) & (df['part1'] <= bins[i+1])
df['tmp'][idx] = i
return df.groupby('tmp').agg({'part2':'mean'})
foo(df)
将与分位数分开,因此保证,每个组将具有相同数量的元素。按tmp
分组将得到这些分组,并计算每个分组的part2
的平均值:
part2
tmp
0 0.15
1 0.32
2 -0.65
答案 1 :(得分:0)
如果要计算每天的平均值,可以使用groupby
,如下所示:
import pandas as pd
df = pd.DataFrame({
"date":["20130101","20130101","20130103","20130103","20130105","20130105"],
"part1":[0.5,0.7,1.3,1.5,0.1,0.3],
"part2":[0.67,-0.03,1.95,-3.25,-0.3,0.6]
})
df.groupby("date").mean().reset_index()
结果:
date part1 part2
0 20130101 0.6 0.32
1 20130103 1.4 -0.65
2 20130105 0.2 0.15
答案 2 :(得分:0)
您可以为熊猫的by
方法的groupby
参数传递函数。
from functools import partial
import pandas as pd
df = pd.DataFrame({
"date":["20130101","20130101","20130103","20130103","20130105","20130105"],
"part1":[0.5,0.7,1.3,1.5,0.1,0.3],
"part2":[0.67,-0.03,1.95,-3.25,-0.3,0.6]
})
def grouper(df, val):
foo = df.iloc[val]['part1']
if foo > 0.0 and foo < 0.4:
return 0
elif foo > 0.3 and foo < 1.0:
return 1
elif foo > 1.0:
return 2
grouped = df['part2'].groupby(by=partial(grouper, df)).mean()
这导致
1 0.15
2 0.32
3 -0.65
Name: part2, dtype: float64