如何创建包含另一个数据帧的某些行的平均值和标准差的python数据帧

时间:2018-02-18 17:31:06

标签: python pandas dataframe

我有一个包含一些值的pandas DataFrame:

                    id  pair      value  subdir
taylor_1e3c_1s_56C  taylor  6_13  -0.398716    run1 
taylor_1e3c_1s_56C  taylor  6_13  -0.397820    run2 
taylor_1e3c_1s_56C  taylor  6_13  -0.397310    run3 
taylor_1e3c_1s_56C  taylor  6_13  -0.390520    run4 
taylor_1e3c_1s_56C  taylor  6_13  -0.377390    run5 
taylor_1e3c_1s_56C  taylor  8_11  -0.393604    run1
taylor_1e3c_1s_56C  taylor  8_11  -0.392899    run2
taylor_1e3c_1s_56C  taylor  8_11  -0.392473    run3
taylor_1e3c_1s_56C  taylor  8_11  -0.389959    run4
taylor_1e3c_1s_56C  taylor  8_11  -0.387946    run5

我想要做的是隔离具有相同索引,id和对的行,计算值列上的平均值和标准差,并将其全部放在新的数据帧中。因为我现在已经有效地平均了subdir的所有可能值,所以也应该删除该列。所以输出应该看起来像这样

                    id  pair      value     error
taylor_1e3c_1s_56C  taylor  6_13  -0.392351 0.013213
taylor_1e3c_1s_56C  taylor  8_11  -0.391376 0.016432

我应该怎么做熊猫?

A previous question向我展示了如何获得均值 - 但我不清楚如何推广这个以获得均值误差(也就是标准偏差)。

非常感谢大家:)

3 个答案:

答案 0 :(得分:3)

您可以将索引提升为列并执行单个groupby

import pandas as pd

df = pd.DataFrame([['taylor', '6_13', -0.398716, 'run1'],
                   ['taylor', '6_13', -0.397820, 'run2'], 
                   ['taylor', '8_11', -0.389959, 'run4'],
                   ['taylor', '8_11', -0.387946, 'run5']],
                  index=['taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C'],
                  columns=['id', 'pair', 'value', 'subdir'])

Original Dataframe

将索引提升为专栏:

df['index'] = df.index

index to column

执行groupby次操作:

new_df = df.groupby(['index', 'id', 'pair']).agg({'value': ['mean', 'std']})

Aggregated New Dataframe

答案 1 :(得分:1)

获得平均值

mean_df = df['value'].groupby(df['pair']).mean()

获取标准差:

std_df = df['value'].groupby(df['pair']).std()

# Rename column to `error`

std_df = std_df.rename(columns={'value':'error'})

将两个必需系列转换为新数据帧:

new_df = pd.concat([mean_df,std_df],axis=1)

希望它有所帮助。很快就会提出改进的答案。

答案 2 :(得分:1)

以下是我从answer改编为previous question以及how to flatten a hierarchical index in columns上的这篇文章的解决方案。

# create dataframe
import pandas as pd
from StringIO import StringIO

text = """                    id  pair      value  subdir
taylor_1e3c_1s_56C  taylor  6_13  -0.398716    run1 
taylor_1e3c_1s_56C  taylor  6_13  -0.397820    run2 
taylor_1e3c_1s_56C  taylor  6_13  -0.397310    run3 
taylor_1e3c_1s_56C  taylor  6_13  -0.390520    run4 
taylor_1e3c_1s_56C  taylor  6_13  -0.377390    run5 
taylor_1e3c_1s_56C  taylor  8_11  -0.393604    run1
taylor_1e3c_1s_56C  taylor  8_11  -0.392899    run2
taylor_1e3c_1s_56C  taylor  8_11  -0.392473    run3
taylor_1e3c_1s_56C  taylor  8_11  -0.389959    run4
taylor_1e3c_1s_56C  taylor  8_11  -0.387946    run5"""
df = pd.DataFrame.from_csv(StringIO(text), sep="\s+")

首先使用groupby()agg()进行汇总,然后致电reset_index()

df1 = df.groupby([df.index, df['id'], df['pair']]).agg({'value': ['mean', 'std']}).reset_index(level=[1,2])

现在重命名列:

df1.columns = ['id', 'pair', 'value', 'error']

输出:

#                        id  pair     value     error
#taylor_1e3c_1s_56C  taylor  6_13 -0.392351  0.008975
#taylor_1e3c_1s_56C  taylor  8_11 -0.391376  0.002359