Question

所以我有一个带

列的数据集

Date, Name, Type, ....

我试图找到一种方法来抓取所有三个作为索引合并的副本，但它似乎没有起作用。我尝试设置索引，然后尝试获取重复项，但它似乎没有正确地给我回复。

我做了：

pc = pc.set_index(['name', 'date', 'type']).sort_index()
pc[pc.index.duplicated()]

但这似乎比我预期的还要多。

Answer 1

使用参数keep=False：

pc = pd.DataFrame([[0, 1, 2, 3, 4],
                   [0, 1, 2, 4, 5],
                   [0, 2, 3, 5, 6]],
                  columns=['name', 'date', 'type', 'val', 'val2'])

pc = pc.set_index(['name', 'date', 'type']).sort_index()

res = pc[pc.index.duplicated(keep=False)]

#                 val  val2
# name date type           
# 0    1    2       3     4
#           2       4     5

根据documentation：

保持： {'first'，'last'，False}，默认为'first'

•first：除第一次出现外，将重复标记为True   •last：除最后一次出现外，将重复标记为True   •False：将所有重复项标记为True。

Answer 2

从documentation我们可以看到subset参数应达到你想要的效果：

# Get the boolean mask
pc.duplicated(['name', 'date', 'type'])

# Only keep duplicates
df[df.duplicated(subset=['name', 'date', 'type'])]

# Only keep unique (using the `first` strategy)
df[~df.duplicated(subset=['name', 'date', 'type'])]

基于三列的重复

2 个答案: