Question

我想通过自定义ID创建组，然后消除某些列中重复的组。

例如

| id | A   | B  |
|----|-----|----|
| 1  | foo | 40 |
| 1  | bar | 50 |
| 2  | foo | 40 |
| 2  | bar | 50 |
| 2  | cod | 0  |
| 3  | foo | 40 |
| 3  | bar | 50 |

到

| id | A   | B  |
|----|-----|----|
| 1  | foo | 40 |
| 1  | bar | 50 |
| 2  | foo | 40 |
| 2  | bar | 50 |
| 2  | cod | 0  |

在这里，我按ID分组，然后删除了3，因为，如果仅考虑列A和B，它们是相同的，而组2中有一些重复的行，但不是精确的副本。

我已经尝试过遍历各个组，但是即使只有大约12.000个组也非常慢。一种可能的并发症是组的大小可变。

这是我一直在努力的解决方案，但是它已经花了很长时间，没有明显的重复点击（我知道这个数据库中存在该问题）

grps = datafinal.groupby('Form_id') 
unique_grps={}

first=True
for lab1, grp1 in grps:
    if first:
        unique_grps[lab1] = grp1
        first=False
        continue
    for lab2, grp2 in unique_grps.copy().items():
        if grp2[['A','B']].equals(grp1[['A','B']]):
            print("hit")
            continue
        unique_grps[lab1] = grp1

Answer 1

使用agg tuple和duplicated

s=df.groupby('id').agg(tuple).sum(1).duplicated()
df.loc[df.id.isin(s[~s].index)]
Out[779]: 
   id    A   B
0   1  foo  40
1   1  bar  50
2   2  foo  40
3   2  bar  50
4   2  cod   0

更多信息：目前，该组中的所有内容都在一个tuple

中

df.groupby('id').agg(tuple).sum(1)
Out[780]: 
id
1            (foo, bar, 40, 50)
2    (foo, bar, cod, 40, 50, 0)
3            (foo, bar, 40, 50)
dtype: object

更新

from natsort import natsorted
s=df.groupby('id').agg(tuple).sum(1).map(natsorted).map(tuple).duplicated()

Answer 2

您可以将itertools文档中的unique_everseen recipe（也可以在more_itertools库中使用）与pd.concat和groupby一起使用：

from operator import itemgetter
from more_itertools import unique_everseen

def unique_key(x):
    return tuple(map(tuple, x[['A', 'B']].values.tolist()))

def jpp(df):
    groups = map(itemgetter(1), df.groupby('id'))
    return pd.concat(unique_everseen(groups, key=unique_key))

print(jpp(df))

   id    A   B
0   1  foo  40
1   1  bar  50
2   2  foo  40
3   2  bar  50
4   2  cod   0

（熊猫）删除由GroupBy创建的重复组

2 个答案: