Question

我有一个Pandas DataFrame数据，其中给定列中的所有行必须匹配：

df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
                   'B': [2,2,2,2,2,2,2,2,2,2],
                   'C': [3,3,3,3,3,3,3,3,3,3],
                   'D': [4,4,4,4,4,4,4,4,4,4],
                   'E': [5,5,5,5,5,5,5,5,5,5]})

In [10]: df
Out[10]:
   A  B  C  D  E
0  1  2  3  4  5
1  1  2  3  4  5
2  1  2  3  4  5
...
6  1  2  3  4  5
7  1  2  3  4  5
8  1  2  3  4  5
9  1  2  3  4  5

我想快速了解一下DataFrame中是否存在差异。在这一点上，我不需要知道哪些值有变化，因为我将在稍后处理它们。我只需要快速了解DataFrame是否需要进一步关注，或者我是否可以忽略它并转到下一个。

我可以使用

检查任何给定的列

(df.loc[:,'A'] != df.loc[0,'A']).any()

但是我的熊猫知识限制我在列中迭代（我理解迭代在Pandas中不受欢迎）来比较所有这些：

   A  B  C  D  E
0  1  2  3  4  5
1  1  2  9  4  5
2  1  2  3  4  5
...
6  1  2  3  4  5
7  1  2  3  4  5
8  1  2  3  4  5
9  1  2  3  4  5

for col in df.columns:
    if (df.loc[:,col] != df.loc[0,col]).any():
        print("Found a fail in col %s" % col)
        break

Out: Found a fail in col C

如果数据框的任何列中的任何行与列中的所有值都不匹配，有没有一种优雅的方法可以返回布尔值...可能没有迭代？

Answer 1

给出您的示例数据框：

df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
                   'B': [2,2,2,2,2,2,2,2,2,2],
                   'C': [3,3,3,3,3,3,3,3,3,3],
                   'D': [4,4,4,4,4,4,4,4,4,4],
                   'E': [5,5,5,5,5,5,5,5,5,5]})

您可以使用以下内容：

df.apply(pd.Series.nunique) > 1

这给了你：

A    False
B    False
C    False
D    False
E    False
dtype: bool

如果我们再强制出现一些错误：

df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20

然后你得到：

A    False
B     True
C     True
D    False
E    False
dtype: bool

Answer 2

您可以将整个DataFrame与第一行进行比较，如下所示：

In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]: 
      A     B     C     D     E
0  True  True  True  True  True
1  True  True  True  True  True
2  True  True  True  True  True
3  True  True  True  True  True
4  True  True  True  True  True
5  True  True  True  True  True
6  True  True  True  True  True
7  True  True  True  True  True
8  True  True  True  True  True
9  True  True  True  True  True

然后测试所有值是否为真：

In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]: 
A    True
B    True
C    True
D    True
E    True
dtype: bool

In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True

Answer 3

您可以使用apply循环遍历列并检查列中的所有元素是否相同：

df.apply(lambda col: (col != col[0]).any())

# A    False
# B    False
# C    False
# D    False
# E    False
# dtype: bool

对于所有列，给定列中的所有行必须匹配

3 个答案: