Question

我有一个导入数据的DataFrame。但是，导入的数据可能不正确，所以我试图摆脱它。示例DataFrame：

    user    test1    test2    other
0   foo       1        7       bar
1   foo       2        9       bar
2   foo       3;as     5       bar
3   foo       3        5       bar

我希望清理列test1和test2。我想摆脱不在指定范围内的值和那些包含字符串的错误（如上面的条目3;as所示）。我这样做是通过定义可接受值的字典：

values_dict = {
    'test1' : [1,2,3],
    'test2' : [5,6,7],
}

以及我希望清理的列名列表：

headers = ['test1', 'test2']

我现在的代码：

# Remove string entries
for i in headers:
    df[i] = pd.to_numeric(df[i], errors='coerce')
    df[i] = df[i].fillna(0).astype(int)

# Remove unwanted values
for i in values_dict:
    df[i] = df[df[i].isin(values_dict[i])]

但似乎没有删除错误的值以形成所需的数据帧：

    user    test1    test2    other
0   foo       1        7       bar
1   foo       3        5       bar

感谢您的帮助！

Answer 1

你可以这样做;使用np.logical_and从多个列构造and条件，并使用它来对数据框进行子集化：

headers = ['test1', 'test2']
df[pd.np.logical_and(*(pd.to_numeric(df[col], errors='coerce').isin(values_dict[col]) for col in headers))]

#  user  test1  test2   other
#0  foo      1      7     bar
#3  foo      3      5     bar

分解：

[pd.to_numeric(df[col], errors='coerce').isin(values_dict[col]) for col in headers]

首先将感兴趣的列转换为数字类型，然后检查列是否在特定范围内;这为每列制作了一个布尔系列：

#[0     True
# 1     True
# 2    False
# 3     True
# Name: test1, dtype: bool, 
# 0     True
# 1    False
# 2     True
# 3     True
# Name: test2, dtype: bool]

为了同时满足所有列的条件，我们需要and操作，可以使用numpy.logical_and进一步构造;在这里使用*将所有列条件解压缩为参数。

从pandas DataFrame中删除不是整数且超出指定数值范围的列

1 个答案: