替换缺失和不稳定的值,Pythons

时间:2016-03-09 14:55:38

标签: python pandas machine-learning missing-data

有以下例子:

 import pandas as pd
df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],'Column B' : [100,'null',30,50,'null']});

The link for the example

我需要一个Python函数,它需要两列并比较它们:

  1. 如果一列是缺失值,我们会从另一列填充。

  2. 如果两个值均为“空白”,我们会保持“空白”。

  3. 如果值不同(不一致),请使用' NULL'

  4. 替换这两个值
  5. 返回一个属性

  6. 运行该函数后数据应如下所示。 the link for the result

    这是我到目前为止所做的,我需要帮助实施第3步

    def myFunction(firAttribute,secAttribute):
        x=df.ix[:,[firAttribute,secAttribute]]
        x['new']=x[firAttribute].fillna(x[secAttribute])
        x['new2']=x[secAttribute].fillna(x[firAttribute])
        x['new'] =x['new'].fillna(x['new2'])
        return x['new'] 
    

1 个答案:

答案 0 :(得分:1)

您可以先replace nullNaN,然后在列之间combine_first NaN,最后一次使用boolean indexing来匹配不同的列值填写NaN

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],
                    'Column B' : [100,'null',30,50,'null']});
print df
  Column A Column B
0     null      100
1       20     null
2       30       30
3       40       50
4     null     null

#replace null to NaN
df = df.replace("null", np.nan)
print df
   Column A  Column B
0       NaN       100
1        20       NaN
2        30        30
3        40        50
4       NaN       NaN
df['Column A'] = df['Column A'].combine_first(df['Column B'])
df['Column B'] = df['Column B'].combine_first(df['Column A'])
print df
   Column A  Column B
0       100       100
1        20        20
2        30        30
3        40        50
4       NaN       NaN

#inconsistent values replace to NaN
df[df['Column A'] != df['Column B']] = np.nan
print df
   Column A  Column B
0       100       100
1        20        20
2        30        30
3       NaN       NaN
4       NaN       NaN