我有这样的数据,我想得到的结果是订单b
和c
是重复的。我怎么解决这个问题? (订单和商品为多索引)
Order Item A
a 1 'aaa'
2 'bb'
b 1 'aaa'
2 'bb'
3 'c'
C 1 'aaa'
2 'bb'
3 'c'
答案 0 :(得分:1)
很简单。您应该从groupby对象转换为Dataframe,然后使用method。
df = df.reset_index()
df.drop_duplicates(keep = 'first', inplace = True)
如果您需要按特定列进行过滤
df.drop_duplicates(subset = [col1, col2, ...], keep = 'first', inplace = True)
编辑
要保持重复:
df = df.groupby('Order')['A'].apply(list).reset_index()
df = df[df.duplicated(subset = ['A'], keep = False)]
如果您只想要订单列表
list_orders = df['Order'].unique()
答案 1 :(得分:1)
首先根据第一级A
创建列MultiIndex
的元组:
s = df.groupby(level=0)['A'].apply(tuple)
print (s)
Order
a ('aaa', 'bb')
b ('aaa', 'bb', 'c')
c ('aaa', 'bb', 'c')
Name: A, dtype: object
然后通过boolean indexing
和Series.duplicated
返回所有重复值的索引:
out = s.index[s.duplicated(keep=False)]
print (out)
Index(['b', 'c'], dtype='object', name='Order')
编辑:
df = pd.DataFrame(data=[[1, 1, 10, 20], [1, 2, 30, 40],
[1, 3, 50, 60], [2, 1, 10, 20],
[2, 2, 30, 40], [2, 3, 50, 60],
[3, 1, 10, 20], [3, 2, 30, 40],
[4, 1, 10, 20], [4, 2, 30, 40]], columns=['id', 'date', 'd1', 'd2'])
print (df)
s = df.groupby('id')['d1','d2'].agg(tuple)
print (s)
d1 d2
id
1 (10, 30, 50) (20, 40, 60)
2 (10, 30, 50) (20, 40, 60)
3 (10, 30) (20, 40)
4 (10, 30) (20, 40)
out = s.reset_index().groupby(s.columns.tolist(), sort=False)['id'].apply(tuple).tolist()
print (out)
[(1, 2), (3, 4)]