Question

我在.csv文件中有以下数据集：

feature1, feature2, feature3, feature4
0, 42, 2, 1000
2, 13, ?, 997
1, 30, ?, 861
2, 29, ?, ?

我想创建一个pandas数据帧或numpy数组，其中我没有x％未知数据的功能（其中x先前已在代码中指定）。

Answer 1

使用data和dropna（PS，您需要在dropna中使用参数thresh）

replace

数据输入

import pandas as pd
import numpy as np
df.replace('?', np.NaN).dropna(axis=1,thresh=0.75*len(df)) # for you example , we only accpet one NA here

Out[735]: 
   feature1  feature2  feature4
0         0         1     100.0
1         2         2     900.0
2         1         3     861.0
3         2         4       NaN

Answer 2

我将假设那些'?'是空值。如果不是，请执行以下操作：

df = df.apply(pd.to_numeric, errors='coerce')

现在，我们可以创建一个采用数据帧和阈值的函数。我们想要做的是使用loc一个布尔系列，告诉我们哪些列有足够的数据表示。

drp = lambda d, x: d.loc[:, d.isnull().mean() < x]

drp(df, .5)

   feature1  feature2  feature4
0         0        42    1000.0
1         2        13     997.0
2         1        30     861.0
3         2        29       NaN

如果你坚持让'?'保持这种状态......我们也可以加入NaN

d = df.mask(df.astype(object).eq('?'))

drp = lambda d, x: d.loc[:, d.isnull().mean() < x]

drp(d, .5)

Answer 3

如果我能正确理解你的问题，这可能是最简单的解决方法。您可以使用?将NaN更改为np.nan，然后使用df.loc和df.isnull选择所需的列。

df.replace(to_replace= '\?', value=np.nan, inplace=True, regex=True)
df = df.loc[:, (df.isnull().sum() <= len(df) / 4)]
print (df)
        feature1  feature2  feature4
0         0         42      1000
1         2         13       997
2         1         30       861
3         2         29       NaN

如何排除pandas数据帧的某些列？

3 个答案: