如何删除缺少至少20%缺失值的列

时间:2016-06-20 11:44:47

标签: python pandas machine-learning

是否有一种有效的方法来删除至少有20%缺失值的列?

假设我的数据框如下:

   A      B      C      D
0  sg     hh     1      7
1  gf                   9
2  hh                   10
3  dd                   8
4                       6 
5  y                    8`

删除列后,数据框将如下所示:

   A       D
0  sg      7
1  gf      9
2  hh      10
3  dd      8
4          6 
5  y       8`

2 个答案:

答案 0 :(得分:10)

您可以boolean indexing使用columns notnull80%的数量大于df.loc[:, pd.notnull(df).sum()>len(df)*.8]

1

这对许多情况很有用,例如,删除大于df.loc[:, (df > 1).sum() > len(df) *. 8] 的值的列数:

.dropna()

或者,对于thresh案例,您还可以指定.dropna() df.dropna(thresh=0.8*len(df), axis=1) 关键字,如@EdChum所示:

df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
    df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan

%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop

%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop

后者会稍快一些:

    let textChecker = UITextChecker()
    let getAvailableLanguages = UITextChecker.availableLanguages()
    print(getAvailableLanguages)
    let partial = "leo"
    let completions = textChecker.completionsForPartialWordRange(NSRange(0..<partial.utf16.count), inString: partial,language: "en_US")
    let completions2 = textChecker.guessesForWordRange(NSRange(0..<partial.utf16.count), inString: partial, language: "en_US")
    print(completions)
    print(completions2)

答案 1 :(得分:3)

您可以致电dropna并传递thresh值,以删除不符合您的阈值条件的列:

In [10]:    
frac = len(df) * 0.8
df.dropna(thresh=frac, axis=1)

Out[10]:
     A   D
0   sg   7
1   gf   9
2   hh  10
3   dd   8
4  NaN   6
5    y   8