Question

假设我有一个df：

        id1   id2   id3  id4  id5   
seq1    hey    go  what   go  key  
seq2   done   six   and  six  six  
...

我需要删除至少在一行中包含重复单词的列（来自不同行的单词是不同的）：

        id1   id3  
seq1    hey  what  
seq2   done   and  
...

此处，由于seq1而删除了id2和id4列，而由于seq2而删除了id2，id4和id5列。

有什么优雅的方法吗？

Answer 1

将boolean indexing与loc一起用于过滤器列：

df = df.loc[:, ~df.apply(lambda x: x.duplicated(keep=False), axis=1).any()]
print (df)
       id1   id3
seq1   hey  what
seq2  done   and

说明：

对于每个行调用duplicated函数：

print (df.apply(lambda x: x.duplicated(keep=False), axis=1))
        id1   id2    id3   id4    id5
seq1  False  True  False  True  False
seq2  False  True  False  True   True

然后按DataFrame.any每列至少检查一个True：

print (df.apply(lambda x: x.duplicated(keep=False), axis=1).any())
id1    False
id2     True
id3    False
id4     True
id5     True
dtype: bool

通过~反转布尔掩码：

print (~df.apply(lambda x: x.duplicated(keep=False), axis=1).any())
id1     True
id2    False
id3     True
id4    False
id5    False
dtype: bool

如何在pandas数据框中删除具有重复的行元素的列？

1 个答案: