Question

我有以下数据框：

| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a    | 1    | 2    | abc  |
| b    | 1    | 2    | abc  |
| c    | 3    | 2    | def  |

我希望基于col2，col3，col4具有重复项的行具有col1的唯一值。

在这种情况下，输出为：

| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a    | 1    | 2    | abc  |
| b    | 1    | 2    | abc  |

df.duplicated排除col1无效，因为我需要将col1信息包含在结果中。我有数百万行，没有这些直接信息，进一步的分析将很困难。我无法将col1设置为索引，因为其他一些值需要设置为索引。

是否有pythonic / pandaic方法来实现这一目标？

Answer 1

我们可以使用groupby：

df[df.groupby(['col2','col3','col4']).col1.transform(len) > 1]

Answer 2

df = pd.DataFrame({'col1': ['a','b','c'],
                  'col2':[1,1,3],
                  'col3': [2,2,2],
                  'col4':['abc','abc', 'def']})

df[df.duplicated(subset = ['col2', 'col3', 'col4'], keep = False)]

df
    col1    col2    col3    col4
0    a       1        2      abc
1    b       1        2      abc

df.duplicated在数据框中查找重复的行。子集查找要检查的特定列，而keep = False则显示两行。如果您只想查看重复的行之一，请删除该行。

Answer 3

我们可以使用id

filter

也df.groupby(['col2','col3','col4']).filter(lambda x : (x['col1'].nunique()==x['col1'].count())&(x['col1'].nunique()>1)) Out[65]: col1 col2 col3 col4 0 a 1 2 abc 1 b 1 2 abc，首先确保您有重复的值行，其次确保您没有仅一行

duplicated

熊猫-检查其他列是否基于其他列重复

3 个答案: