获取分组数据中特定列的标准偏差

时间:2019-08-28 23:15:21

标签: python pandas statistics

我有一个数据,我希望得到该特定列的标准偏差,然后将其结果再次添加到原始数据中。

import pandas as pd

raw_data = {'patient': [242, 151, 111,122, 342],
        'obs': [1, 2, 3, 1, 2],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'weak', 'weak', 'strong']}

df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])

df

   patient  obs  treatment   score
0      242    1          0  strong
1      151    2          1    weak
2      111    3          0    weak
3      122    1          1    weak
4      342    2          0  strong

所以我想获得patient列的std dev,该列按score列分组

所以我想要的方法是扫描列并找到patient列,并检查它是否也是numeric(希望将来也添加它)并进行标准差计算,最后将结果添加到原始的df

我这样尝试过

std_dev_patient = []

for col in df.keys():

    df=df.groupby("score")

    if df[col]=='patient':
           np.std(col).append(std_dev_patient)
    else:
        pass

    df.concat([df,std_dev_patient], axis =1)

    df
  

TypeError:“ str”对象不可调用

有没有一种方法可以有效地完成此过程?

Thx

预期输出

   patient  obs  treatment   score  std_dev_patient std_dev_obs
0      242    1          0  strong    70.71            ..
1      151    2          1    weak    20.66            ..  
2      111    3          0    weak    20.66            ..
3      122    1          1    weak    20.66            .. 
4      342    2          0  strong    70.71            ..  

2 个答案:

答案 0 :(得分:3)

使用pandas.Dataframe.groupby.transform

df['std_dev_patient'] = df.groupby('score')['patient'].transform('std')
print(df)
print(df.select_dtypes(np.number).dtypes)

输出:

   patient  obs  treatment   score  std_dev_patient
0      242    1          0  strong        70.710678
1      151    2          1    weak        20.663978
2      111    3          0    weak        20.663978
3      122    1          1    weak        20.663978
4      342    2          0  strong        70.710678

要进行dtype检查,请将pandas.DataFrame.select_dtypesnumpy.number一起使用:

import numpy as np

g = df.groupby('score')
for c in df.select_dtypes(np.number).columns:
    df['std_dev_%s' % c] = g[c].transform('std')

输出:

   patient  obs  treatment   score  std_dev_patient  std_dev_obs  \
0      242    1          0  strong        70.710678     0.707107   
1      151    2          1    weak        20.663978     1.000000   
2      111    3          0    weak        20.663978     1.000000   
3      122    1          1    weak        20.663978     1.000000   
4      342    2          0  strong        70.710678     0.707107   

   std_dev_treatment  
0            0.00000  
1            0.57735  
2            0.57735  
3            0.57735  
4            0.00000  

答案 1 :(得分:2)

这是你的追求吗?

df['std_dev_patient'] = df.score.map(df.groupby(by='score').patient.std())
df

    patient obs treatment   score   std_dev_patient
0   242     1   0           strong  70.710678
1   151     2   1           weak    20.663978
2   111     3   0           weak    20.663978
3   122     1   1           weak    20.663978
4   342     2   0           strong  70.710678

要以for循环方式在多列上计算std,只需将所需的列名称放入std_cols列表中即可。

std_cols = ['patient', 'obs']

for col in std_cols:
    df[f'std_dev_{col}'] = df.score.map(df.groupby(by='score')[col].std())


patient obs treatment   score   std_dev_patient std_dev_obs
0       242 1       0   strong  70.710678       0.707107
1       151 2       1   weak    20.663978       1.000000
2       111 3       0   weak    20.663978       1.000000
3       122 1       1   weak    20.663978       1.000000
4       342 2       0   strong  70.710678       0.707107

要使OP的原始for loop解决方案有效,请执行以下操作:

std_dev_patient = []
df_g=df.groupby("score")
df_g=df.groupby("score")
for col in df.keys():
    if col=='patient':
        std_dev_patient.append(df_g[col].std())
    else:
        pass
df['std_dev_patient'] = df.score.map(std_dev_patient[0])

patient obs treatment   score   std_dev_patient
0   242 1   0           strong  70.710678
1   151 2   1           weak    20.663978
2   111 3   0           weak    20.663978
3   122 1   1           weak    20.663978
4   342 2   0           strong  70.710678