我有一个数据,我希望得到该特定列的标准偏差,然后将其结果再次添加到原始数据中。
import pandas as pd
raw_data = {'patient': [242, 151, 111,122, 342],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong', 'weak', 'weak', 'weak', 'strong']}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
patient obs treatment score
0 242 1 0 strong
1 151 2 1 weak
2 111 3 0 weak
3 122 1 1 weak
4 342 2 0 strong
所以我想获得patient
列的std dev,该列按score
列分组
所以我想要的方法是扫描列并找到patient
列,并检查它是否也是numeric
(希望将来也添加它)并进行标准差计算,最后将结果添加到原始的df
我这样尝试过
std_dev_patient = []
for col in df.keys():
df=df.groupby("score")
if df[col]=='patient':
np.std(col).append(std_dev_patient)
else:
pass
df.concat([df,std_dev_patient], axis =1)
df
TypeError:“ str”对象不可调用
有没有一种方法可以有效地完成此过程?
Thx
patient obs treatment score std_dev_patient std_dev_obs
0 242 1 0 strong 70.71 ..
1 151 2 1 weak 20.66 ..
2 111 3 0 weak 20.66 ..
3 122 1 1 weak 20.66 ..
4 342 2 0 strong 70.71 ..
答案 0 :(得分:3)
使用pandas.Dataframe.groupby.transform
:
df['std_dev_patient'] = df.groupby('score')['patient'].transform('std')
print(df)
print(df.select_dtypes(np.number).dtypes)
输出:
patient obs treatment score std_dev_patient
0 242 1 0 strong 70.710678
1 151 2 1 weak 20.663978
2 111 3 0 weak 20.663978
3 122 1 1 weak 20.663978
4 342 2 0 strong 70.710678
要进行dtype
检查,请将pandas.DataFrame.select_dtypes
与numpy.number
一起使用:
import numpy as np
g = df.groupby('score')
for c in df.select_dtypes(np.number).columns:
df['std_dev_%s' % c] = g[c].transform('std')
输出:
patient obs treatment score std_dev_patient std_dev_obs \
0 242 1 0 strong 70.710678 0.707107
1 151 2 1 weak 20.663978 1.000000
2 111 3 0 weak 20.663978 1.000000
3 122 1 1 weak 20.663978 1.000000
4 342 2 0 strong 70.710678 0.707107
std_dev_treatment
0 0.00000
1 0.57735
2 0.57735
3 0.57735
4 0.00000
答案 1 :(得分:2)
这是你的追求吗?
df['std_dev_patient'] = df.score.map(df.groupby(by='score').patient.std())
df
patient obs treatment score std_dev_patient
0 242 1 0 strong 70.710678
1 151 2 1 weak 20.663978
2 111 3 0 weak 20.663978
3 122 1 1 weak 20.663978
4 342 2 0 strong 70.710678
要以for循环方式在多列上计算std,只需将所需的列名称放入std_cols列表中即可。
std_cols = ['patient', 'obs']
for col in std_cols:
df[f'std_dev_{col}'] = df.score.map(df.groupby(by='score')[col].std())
patient obs treatment score std_dev_patient std_dev_obs
0 242 1 0 strong 70.710678 0.707107
1 151 2 1 weak 20.663978 1.000000
2 111 3 0 weak 20.663978 1.000000
3 122 1 1 weak 20.663978 1.000000
4 342 2 0 strong 70.710678 0.707107
要使OP的原始for loop解决方案有效,请执行以下操作:
std_dev_patient = []
df_g=df.groupby("score")
df_g=df.groupby("score")
for col in df.keys():
if col=='patient':
std_dev_patient.append(df_g[col].std())
else:
pass
df['std_dev_patient'] = df.score.map(std_dev_patient[0])
patient obs treatment score std_dev_patient
0 242 1 0 strong 70.710678
1 151 2 1 weak 20.663978
2 111 3 0 weak 20.663978
3 122 1 1 weak 20.663978
4 342 2 0 strong 70.710678