计算数据束中的z得分,但不包括N.A.

时间:2016-09-23 07:45:17

标签: python pandas

所以我得到了一堆带有N.A.值的数据:

Data Dump

那么如何在排除N.A.值的同时获得每列的z得分?这样的z得分输出看起来像这样吗?

Z-Score value output

所以这就是我所拥有的,这是基于以前的问题:

cols = list(df.columns)
df[cols]
for col in cols:
    col_zscore = col + '_zscore'
    df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof="N.A.")

但我得到了TypeError。

请帮助,我真的是初学者。

1 个答案:

答案 0 :(得分:1)

我认为您需要replace首先N.A.NaN并将值转换为float

df = df.replace({'N.A.': np.nan}).astype(float)

for col in df.columns:
    if col != 'PE Trail':
        col_zscore = col + '_zscore'
        df[col_zscore] = (df[col] - df[col].mean())/df[col].std()

print (df)
   PE Trail  PE fwd   PB  PE fwd_zscore  PB_zscore
0       NaN    1.00  1.0       1.317465   0.707107
1       NaN    0.50  NaN       0.146385        NaN
2       NaN    0.00  0.5      -1.024695  -0.707107
3       NaN    0.25  NaN      -0.439155        NaN

std中参数type的{​​{1}}值ddofint

如果使用read_csv,参数na_values会导致N.A.转换为NaN

import pandas as pd
import numpy as np
import io

temp=u"""PE Trail;PE fwd;PB
N.A.;1;1
N.A.;0.5;N.A.
N.A.;0;0.5
N.A.;0.25;N.A."""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", na_values='N.A.')
print (df)
   PE Trail  PE fwd   PB
0       NaN    1.00  1.0
1       NaN    0.50  NaN
2       NaN    0.00  0.5
3       NaN    0.25  NaN