我有一个包含一些值的pandas DataFrame:
id pair value subdir
taylor_1e3c_1s_56C taylor 6_13 -0.398716 run1
taylor_1e3c_1s_56C taylor 6_13 -0.397820 run2
taylor_1e3c_1s_56C taylor 6_13 -0.397310 run3
taylor_1e3c_1s_56C taylor 6_13 -0.390520 run4
taylor_1e3c_1s_56C taylor 6_13 -0.377390 run5
taylor_1e3c_1s_56C taylor 8_11 -0.393604 run1
taylor_1e3c_1s_56C taylor 8_11 -0.392899 run2
taylor_1e3c_1s_56C taylor 8_11 -0.392473 run3
taylor_1e3c_1s_56C taylor 8_11 -0.389959 run4
taylor_1e3c_1s_56C taylor 8_11 -0.387946 run5
我想要做的是隔离具有相同索引,id和对的行,计算值列上的平均值和标准差,并将其全部放在新的数据帧中。因为我现在已经有效地平均了subdir的所有可能值,所以也应该删除该列。所以输出应该看起来像这样
id pair value error
taylor_1e3c_1s_56C taylor 6_13 -0.392351 0.013213
taylor_1e3c_1s_56C taylor 8_11 -0.391376 0.016432
我应该怎么做熊猫?
A previous question向我展示了如何获得均值 - 但我不清楚如何推广这个以获得均值误差(也就是标准偏差)。
非常感谢大家:)
答案 0 :(得分:3)
您可以将索引提升为列并执行单个groupby
:
import pandas as pd
df = pd.DataFrame([['taylor', '6_13', -0.398716, 'run1'],
['taylor', '6_13', -0.397820, 'run2'],
['taylor', '8_11', -0.389959, 'run4'],
['taylor', '8_11', -0.387946, 'run5']],
index=['taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C'],
columns=['id', 'pair', 'value', 'subdir'])
将索引提升为专栏:
df['index'] = df.index
执行groupby
次操作:
new_df = df.groupby(['index', 'id', 'pair']).agg({'value': ['mean', 'std']})
答案 1 :(得分:1)
获得平均值
mean_df = df['value'].groupby(df['pair']).mean()
获取标准差:
std_df = df['value'].groupby(df['pair']).std()
# Rename column to `error`
std_df = std_df.rename(columns={'value':'error'})
将两个必需系列转换为新数据帧:
new_df = pd.concat([mean_df,std_df],axis=1)
希望它有所帮助。很快就会提出改进的答案。
答案 2 :(得分:1)
以下是我从answer改编为previous question以及how to flatten a hierarchical index in columns上的这篇文章的解决方案。
# create dataframe
import pandas as pd
from StringIO import StringIO
text = """ id pair value subdir
taylor_1e3c_1s_56C taylor 6_13 -0.398716 run1
taylor_1e3c_1s_56C taylor 6_13 -0.397820 run2
taylor_1e3c_1s_56C taylor 6_13 -0.397310 run3
taylor_1e3c_1s_56C taylor 6_13 -0.390520 run4
taylor_1e3c_1s_56C taylor 6_13 -0.377390 run5
taylor_1e3c_1s_56C taylor 8_11 -0.393604 run1
taylor_1e3c_1s_56C taylor 8_11 -0.392899 run2
taylor_1e3c_1s_56C taylor 8_11 -0.392473 run3
taylor_1e3c_1s_56C taylor 8_11 -0.389959 run4
taylor_1e3c_1s_56C taylor 8_11 -0.387946 run5"""
df = pd.DataFrame.from_csv(StringIO(text), sep="\s+")
首先使用groupby()
和agg()
进行汇总,然后致电reset_index()
:
df1 = df.groupby([df.index, df['id'], df['pair']]).agg({'value': ['mean', 'std']}).reset_index(level=[1,2])
现在重命名列:
df1.columns = ['id', 'pair', 'value', 'error']
输出:
# id pair value error
#taylor_1e3c_1s_56C taylor 6_13 -0.392351 0.008975
#taylor_1e3c_1s_56C taylor 8_11 -0.391376 0.002359