给定以下格式的数据框df
:
item attr
1 {1, 2, 3, 4}
2 {2, 4, 3, 2, 10}
3 {4, 37}
4 {1, 2, 3, 4}
我想找到具有相同attr
的项目对,例如,item 1
和item 2
。请注意,df
完全包含200,000
项。我希望以最快的方式找到它们。你知道怎么做吗?提前谢谢!
答案 0 :(得分:1)
您可以先将set
转换为tuple
,然后转换为aggregate
nunique
和unique
。上次使用boolean indexing
:
df = pd.DataFrame({'item':[1,2,3,4],
'attr':[set({1, 2, 3, 4}),set({2, 4, 3, 2, 10}),
set({4, 37}), set({1, 2, 3, 4})]})
print (df)
attr item
0 {1, 2, 3, 4} 1
1 {3, 10, 2, 4} 2
2 {4, 37} 3
3 {1, 2, 3, 4} 4
df.attr = df.attr.apply(tuple)
print (df)
attr item
0 (1, 2, 3, 4) 1
1 (3, 10, 2, 4) 2
2 (4, 37) 3
3 (1, 2, 3, 4) 4
df1 = df.item.groupby(df['attr']).agg(['nunique', 'unique'])
df1 = df1[df1['nunique'] == 2]
print (df1)
nunique unique
attr
(1, 2, 3, 4) 2 [1, 4]
如果DataFrame
中只有一个或一对值duplicated
,则为另一种解决方案:
df = pd.DataFrame({'item':[1,2,3,4],
'attr':[set({1, 2, 3, 4}),set({4, 37}),
set({4, 37}), set({1, 2, 3, 4})]})
print (df)
attr item
0 {1, 2, 3, 4} 1
1 {4, 37} 2
2 {4, 37} 3
3 {1, 2, 3, 4} 4
df.attr = df.attr.apply(tuple)
df1 = df[df.duplicated('attr', keep=False)]
df1 = df1.groupby('attr')['item'].apply(lambda x: x.tolist())
print (df1)
(1, 2, 3, 4) [1, 4]
(4, 37) [2, 3]
Name: item, dtype: object
通过评论编辑:
使用melt
进行重塑:
df = pd.DataFrame({'item':[1,2,3,4,5],
'attr1':[set({1, 2, 3, 4}),set({4, 37}),set({4, 37}),
set({1, 2, 3, 4}), set({4,8})],
'attr2':[set({1, 2 }),set({4, 37}),
set({4, 3}), set({1, 2}), set({4,8})]})
print (df)
attr1 attr2 item
0 {1, 2, 3, 4} {1, 2} 1
1 {4, 37} {4, 37} 2
2 {4, 37} {3, 4} 3
3 {1, 2, 3, 4} {1, 2} 4
4 {8, 4} {8, 4} 5
df = pd.melt(df, id_vars='item', value_name='attr').drop('variable', axis=1)
df.attr = df.attr.apply(tuple)
print (df)
item attr
0 1 (1, 2, 3, 4)
1 2 (4, 37)
2 3 (4, 37)
3 4 (1, 2, 3, 4)
4 5 (8, 4)
5 1 (1, 2)
6 2 (4, 37)
7 3 (3, 4)
8 4 (1, 2)
9 5 (8, 4)