Question

我有一个Pandas DataFrame，其中缺少一些值（由?表示）。有没有简单的方法可以删除至少有一列具有值?？

的所有行

通常，我会做布尔索引，但我有很多列。一种方法如下：

for index, row in df.iterrows():
    for col in df.columns:
        if '?' in row[col]:
            #delete row

但这似乎是非平等......

有什么想法吗？

Answer 1

选项1a
boolean indexing和any

df 
     col1  col2 col3 col4
row1   65    24   47    ?
row2   33    48    ?   89
row3    ?    34   67    ?
row4   24    12   52   17

(df.astype(str) == '?').any(1)
row1     True
row2     True
row3     True
row4    False
dtype: bool

df = df[~(df.astype(str) == '?').any(1)]
df
     col1  col2 col3 col4
row4   24    12   52   17

此处，astype(str)检查是为了防止在您的数据框中混合使用字符串和数字列时引发TypeError: Could not compare ['?'] with block values。

选项1b 与values

直接比较

(df.values == '?').any(1)
array([ True,  True,  True, False], dtype=bool)

df = df[~(df.values == '?').any(1)]
df
     col1  col2 col3 col4
row4   24    12   52   17

选项2
df.replace和df.notnull

df.replace('?', np.nan).notnull().all(1)
row1    False
row2    False
row3    False
row4     True
dtype: bool

df = df[df.replace('?', np.nan).notnull().all(1)]
     col1  col2 col3 col4
row4   24    12   52   17

这避免了astype(str)电话。或者，你可能会像温建议那样做，然后放弃它们：

df.replace('?', np.nan).dropna()

Answer 2

或仅replace为NaN并使用dropna

df.replace({'?':np.nan}).dropna()
Out[126]: 
     col1  col2 col3 col4
row4   24    12   52   17

Answer 3

您可以boolean indexing与all一起使用，以检查值是否包含?

如果是混合类型 - 数字与int s：

df = pd.DataFrame({'B':[4,5,'?',5,5,4],
                   'C':[7,'?',9,4,2,3],
                   'D':[1,3,5,7,'?',0],
                   'E':[5,3,'?',9,2,4]})

print (df)
   B  C  D  E
0  4  7  1  5
1  5  ?  3  3
2  ?  9  5  ?
3  5  4  7  9
4  5  2  ?  2
5  4  3  0  4

df = df[(df.astype(str) != '?').all(axis=1)].astype(int)
print (df)
   B  C  D  E
0  4  7  1  5
3  5  4  7  9
5  4  3  0  4

或与values创建的numpy数组进行比较：

df = df[(df.values != '?').all(axis=1)]
print (df)
   B  C  D  E
0  4  7  1  5
3  5  4  7  9
5  4  3  0  4

如果所有值都是字符串，则解决方案可以简化：

df = pd.DataFrame({'B':[4,5,'?',5,5,4],
                   'C':[7,'?',9,4,2,3],
                   'D':[1,3,5,7,'?',0],
                   'E':[5,3,'?',9,2,4]}).astype(str)


df = df[(df != '?').all(axis=1)].astype(int)
print (df)
   B  C  D  E
0  4  7  1  5
3  5  4  7  9
5  4  3  0  4

从包含问号的数据框中删除所有行（？）

3 个答案: