Question

我有一个数据框。一些列应该只有0或1s。我需要找到数字不是0或1的列，然后从原始数据集中删除整行。

我创建了第二个数据框，其中包含必须检查的列。找到索引并将其从原始数据框中删除后，我没有得到正确的答案。

#Reading in the data:
data=pd.read_csv('DataSet.csv')

#Creating subset df of the columns that must be only 0 or 1 (which is all rows in columns 2 onwards:
subset = data.iloc[:,2:]

#find indices:
index = subset[ (subset!= 0) & (subset!= 1)].index

#remove rows from orig data set:
data = data.drop(index)

它给了我一个空的索引数组。请帮助。

Answer 1

您的subset是pd.DataFrame，而不是pd.Series。如果index是系列（即，如果您仅检查单列而不是多列的条件），则您对subset进行的条件测试将起作用。

因此，将subset用作DataFrame很好，但是它改变了条件切片的工作方式。我的测试显示您的index var返回0和1的NaN（而不是像Series的一部分那样将它们排除在外）。如下添加dropna（）应该可以修复您的代码：

#find indices:
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index

#remove rows from orig data set:
data = data.drop(index)

Answer 2

没有来自DataSet.csv的数据，我试图做出一个猜测。

subset[ (subset!= 0) & (subset!= 1)]基本上返回subset数据帧，其中False上的值(subset!= 0) & (subset!= 1)变成NaN，而那些True保持相同的值。即这等效于map。它不是过滤器。

因此，subset[ (subset!= 0) & (subset!= 1)].index是data数据帧的整个索引

您将其删除，因此它将返回空数据框

Answer 3

示例：

data = pd.DataFrame({
        'A':list('abcdef'),
         'B':[4,5,4,5,5,4],
         'D':[1,0,1,0,1,0],
         'E':[1,0,0,1,2,4],

})

print (data)
   A  B  D  E
0  a  4  1  1
1  b  5  0  0
2  c  4  1  0
3  d  5  0  1
4  e  5  1  2
5  f  4  0  4

如果仅需要1和0值，请使用DataFrame.isin和DataFrame.all来测试每行是否所有True：

subset = data.iloc[:,2:]
data3 = data[subset.isin([0,1]).all(axis=1)]
print (data3)

   A  B  D  E
0  a  4  1  1
1  b  5  0  0
2  c  4  1  0
3  d  5  0  1

详细信息：

print (subset.isin([0,1]))
      D      E
0  True   True
1  True   True
2  True   True
3  True   True
4  True  False
5  True  False

print (subset.isin([0,1]).all(axis=1))
0     True
1     True
2     True
3     True
4    False
5    False
dtype: bool

Answer 4

根据您的代码，我做出了一个计算得出的猜测，您想将其与more than 1列进行比较。

这应该可以解决问题

# Selects only elements that are 0 or 1
val = np.isin(subset, np.array([0, 1]))

# Generate index
index = np.prod(val, axis=1) > 0

# Select only desired columns
data = data[index]

示例

# Data
   a  b  c
0  1  1  1
1  2  2  2
2  3  1  3
3  4  3  3
4  5  3  1

# Removing rows that have elements other than 1 or 2
   a  b  c
0  1  1  1
1  2  2  2

扫描PD DataFrame的子集以获得与某些值匹配的索引

4 个答案: