Question

我的df包含重复的id：

id    text      text2     text3
1     hello     hello     hello
1     hello     hello     hello
2     hello     hello     goodbye
2     goodbye   hello     goodbye
2     hello     hello     goodbye

我想删除id的所有值相同的列。这可能意味着列中的所有值都相同（text2），或者每个id（text3）的所有值都相同。

期望的结果：

id    text     
1     hello     
1     hello       
2     hello        
2     goodbye        
2     hello

我用它来识别每列中唯一值的计数：

df.apply(lambda x: len(x.unique()))

如果我删除此计数等于1的所有列，则会处理text2方案。但是，我应该如何处理text3场景？ df已按ID分组以查找重复项，但我是否需要再次使用groupby？

作为一个“奖励”，我不介意知道如何识别哪一个id的文本完全相同（即text）。我基本上试图找到哪些列导致重复。

感谢您提供所有可能的见解！

Answer 1

这是一种方式。

为每列获取唯一值

In [1227]: u = df.nunique()

获取每个id组中是否有任何列具有单个值

In [1228]: gu = gu = (df.groupby('id').agg('nunique') == 1).all()

获取满足条件的索引名称列

u[u == 1].index.union(gu[gu].index).drop('id')然后，使用drop

In [1229]: df.drop(u[u == 1].index.union(gu[gu].index).drop('id'), axis=1)
Out[1229]:
   id     text
0   1    hello
1   1    hello
2   2    hello
3   2  goodbye
4   2    hello

详细

In [1304]: u
Out[1304]:
id       2
text     2
text2    1
text3    2
dtype: int64

In [1305]: gu
Out[1305]:
id        True
text     False
text2     True
text3     True
dtype: bool

In [1306]: u[u == 1].index.union(gu[gu].index).drop('id')
Out[1306]: Index([u'text2', u'text3'], dtype='object')

Answer 2

在大多数情况下，此功能对我有用：

def remove_single_unique_values(dataframe):

"""
Drop all the columns that only contain one unique value.
not optimized for categorical features yet.

"""    
cols_to_drop = dataframe.nunique()
cols_to_drop = cols_to_drop.loc[cols_to_drop.values==1].index
dataframe = dataframe.drop(cols_to_drop,axis=1)
return dataframe

只有一个唯一值的Pandas Drop Columns

2 个答案: