熊猫发现重复数据

时间:2018-09-25 07:37:12

标签: python pandas duplicates

我有这样的数据,我想得到的结果是订单bc是重复的。我怎么解决这个问题? (订单和商品为多索引)

Order    Item     A
  a       1       'aaa'
          2       'bb'
  b       1       'aaa'
          2       'bb'
          3       'c'
  C       1       'aaa'
          2       'bb'
          3       'c'

2 个答案:

答案 0 :(得分:1)

很简单。您应该从groupby对象转换为Dataframe,然后使用method。

df = df.reset_index()
df.drop_duplicates(keep = 'first', inplace = True)

如果您需要按特定列进行过滤

df.drop_duplicates(subset = [col1, col2, ...], keep = 'first', inplace = True)

编辑

要保持重复:

df = df.groupby('Order')['A'].apply(list).reset_index()
df = df[df.duplicated(subset = ['A'], keep = False)]

如果您只想要订单列表

list_orders = df['Order'].unique()

答案 1 :(得分:1)

首先根据第一级A创建列MultiIndex的元组:

s = df.groupby(level=0)['A'].apply(tuple)
print (s)
Order
a         ('aaa', 'bb')
b    ('aaa', 'bb', 'c')
c    ('aaa', 'bb', 'c')
Name: A, dtype: object

然后通过boolean indexingSeries.duplicated返回所有重复值的索引:

out = s.index[s.duplicated(keep=False)]
print (out)
Index(['b', 'c'], dtype='object', name='Order')

编辑:

df = pd.DataFrame(data=[[1, 1, 10, 20], [1, 2, 30, 40], 
                        [1, 3, 50, 60], [2, 1, 10, 20], 
                        [2, 2, 30, 40], [2, 3, 50, 60],
                        [3, 1, 10, 20], [3, 2, 30, 40],
                        [4, 1, 10, 20], [4, 2, 30, 40]], columns=['id', 'date', 'd1', 'd2']) 
print (df)

s = df.groupby('id')['d1','d2'].agg(tuple)
print (s)
              d1            d2
id                            
1   (10, 30, 50)  (20, 40, 60)
2   (10, 30, 50)  (20, 40, 60)
3       (10, 30)      (20, 40)
4       (10, 30)      (20, 40)

out = s.reset_index().groupby(s.columns.tolist(), sort=False)['id'].apply(tuple).tolist()
print (out)
[(1, 2), (3, 4)]