python

时间:2018-04-15 15:09:07

标签: python pandas dataframe statistics summary

我正在尝试获取包含AB给定Y=1Y=0.的统计信息(平均值,变量,标准偏差等)的表格。例如:

鉴于此数据框:

df = pd.DataFrame({'A': [0,    0.91, np.NaN, 0.75,   np.NaN, 1], 
                   'B': [0.43, 1,    0.34,   np.NaN, 0,      0.64],
                   'Y': [1,    0,    1,      1,      0,      1]
                      })

我用以下方法计算统计数据:

for i in df:
    print(i)
    print("Mean Y1 " + " " + str(df[i][df["Y"]==1].mean()))
    print("Mean Y0 " + " " + str(df[i][df["Y"]==0].mean()))
    print("Var Y1 " + " " + str(np.var(df[i][df["Y"]==1])))
    print("Var Y0 " + " " + str(np.var(df[i][df["Y"]==0])))

但是,我无法比较它们,所以我正在尝试创建一个包含如下统计信息的表:

stats = pd.DataFrame({'Column names': ['A', 'B', 'Y']
                   'Mean Y1': [A_mean_given_Y==1, B_mean_given_Y==1, Z], 
                   'Mean Y0': [A_mean_given_Y==0, B_mean_given_Y==0, Z],
                   'Var Y1': [A_var_given_Y==1,   B_var_given_Y==1,  Z],
                   'Var Y0': [A_var_given_Y==0,   B_var_given_Y==0,  Z] 
                  })

# NOTE: Z is any number, as its value doesn't matter.

但是,df不接受函数.append,因为它是列表。并且在计算统计数据后转换数据框中的列表列表是非常低效的。那么,任何想法如何用循环创建统计数据框?

2 个答案:

答案 0 :(得分:1)

我认为首先需要DataFrameGroupBy.agg汇总函数列表,然后展平MultiIndex,如果需要重新设置stackunstack

df1 = df.groupby('Y').agg(['mean','var'])
df1.columns = df1.columns.map('_'.join)
print (df1)
     A_mean     A_var  B_mean   B_var
Y                                    
0  0.910000       NaN    0.50  0.5000
1  0.583333  0.270833    0.47  0.0237

或者:

df1 = df.groupby('Y').agg(['mean','var']).stack().sort_index(level=1)
df1.index = ['{}_{}'.format(j, i) for i, j in df1.index]
print (df1)
               A       B
mean_0  0.910000  0.5000
mean_1  0.583333  0.4700
var_0        NaN  0.5000
var_1   0.270833  0.0237

或者:

df1 = df.groupby('Y').agg(['mean','var']).stack(0).unstack(0)
df1.columns = ['{}_{}'.format(i,j) for i, j in df1.columns]
print (df1)
   mean_0    mean_1  var_0     var_1
A    0.91  0.583333    NaN  0.270833
B    0.50  0.470000    0.5  0.023700

输出中的Series

s = df.groupby('Y').agg(['mean','var']).unstack()
s.index = ['{}_{}_{}'.format(i,j,k) for i, j,k in s.index]
print (s)
A_mean_0    0.910000
A_mean_1    0.583333
A_var_0          NaN
A_var_1     0.270833
B_mean_0    0.500000
B_mean_1    0.470000
B_var_0     0.500000
B_var_1     0.023700
dtype: float64

答案 1 :(得分:0)

我最后以这种方式做了灵活性(例如,你没有受到agg函数的限制,你可以在表中添加任何函数,只需在循环中添加它):

 df = pd.DataFrame({'A': [0,    0.91, np.NaN, 0.75,   np.NaN, 1], 
                   'B': [0.43, 1,    0.34,   np.NaN, 0,      0.64],
                   'Y': [1,    0,    1,      1,      0,      1]
                      })   
stats = []
for i in df:
    new_row = [
        i,
        df[i][df["Y"]==1].mean(),
        df[i][df["Y"]==0].mean(),
        np.nanvar(df[i][df["Y"]==1]),
        np.nanvar(df[i][df["Y"]==0]),
    ]
    stats.append(new_row)

col_stats= ['Variable', 'Mean Y=1', 'Mean Y=0', 'Var Y=1', 'Var Y=0']
stats = pd.DataFrame(stats, columns=col_stats)
stats