Python删除包含大量缺失值的列

时间:2017-09-22 20:22:51

标签: python pandas

我正在尝试删除包含一定百分比缺失值的列。 以下是一个工作示例:

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
    'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 
    'age': [42, '' , '', '', 73], 
    'sex': ['m', np.nan, 'f', 'm', 'f'], 
    'preTestScore': [4, np.nan, np.nan, 2, 3],
    'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 
    'sex', 'preTestScore', 'postTestScore'])
df
 first_name last_name   age sex preTestScore    postTestScore
 0  Jason   Miller       42  m    4.0             25.0
 1  NaN     NaN              NaN  NaN             NaN
 2  Tina    Ali              f    NaN             NaN
 3  Jake    Milner           m    2.0             62.0
 4  Amy     Cooze       73   f    3.0             70.0

df = df.dropna(thresh=0.7*len(df), axis=1)
df
first_name  last_name   age sex
0   Jason   Miller      42  m
1   NaN     NaN             NaN
2   Tina    Ali             f
3   Jake    Milner          m
4   Amy     Cooze       73  f

我怎样才能放弃这个年龄'专栏也是?我花了几个小时使用drop.na,试图在空单元格中放入零。我无法弄清楚如何检测“年龄”中的缺失细胞。柱。

3 个答案:

答案 0 :(得分:4)

您需要replace,然后dropna

df=df.replace({'':np.nan})
df = df.dropna(thresh=0.7*len(df), axis=1)
df
Out[858]: 
  first_name last_name  sex
0      Jason    Miller    m
1        NaN       NaN  NaN
2       Tina       Ali    f
3       Jake    Milner    m
4        Amy     Cooze    f

答案 1 :(得分:1)

首先用NaN替换''/(空白),然后使用dropna()

df = df.replace({'':np.nan})
df

      first_name last_name   age  sex  preTestScore  postTestScore
0      Jason    Miller  42.0    m           4.0           25.0
1        NaN       NaN   NaN  NaN           NaN            NaN
2       Tina       Ali   NaN    f           NaN            NaN
3       Jake    Milner   NaN    m           2.0           62.0
4        Amy     Cooze  73.0    f           3.0           70.0

您可以使用以下功能检查缺失值%

def missing(dff):
    print("Missing values in %")
    print(round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

Missing values in %
age              60.0
postTestScore    40.0
preTestScore     40.0
sex              20.0
last_name        20.0
first_name       20.0
dtype: float64

比方说,您要删除所有缺失值大于或等于60%的列

df = df.drop(df.loc[:,list((100*(df.isnull().sum()/len(df.index))>=60))].columns, 1)

  first_name last_name  sex  preTestScore  postTestScore
0      Jason    Miller    m           4.0           25.0
1        NaN       NaN  NaN           NaN            NaN
2       Tina       Ali    f           NaN            NaN
3       Jake    Milner    m           2.0           62.0
4        Amy     Cooze    f           3.0           70.0

注意:“年龄”列(缺少60%的值)已删除。

答案 2 :(得分:0)

使用来自熊猫的 dropna 怎么样:

def drop_columns(df, threshold):
    return(data.dropna(axis = 1, thresh = (len(data) * (1-threshold))))

(这是我第一次回答,如果我不遵守礼仪,请见谅)