快速找到python-pandas中某列数据框中的重复单元格?

时间:2016-10-06 06:05:18

标签: python pandas python-3.5

给定以下格式的数据框df

item        attr
1           {1, 2, 3, 4}
2           {2, 4, 3, 2, 10}
3           {4, 37}
4           {1, 2, 3, 4}

我想找到具有相同attr的项目对,例如,item 1item 2。请注意,df完全包含200,000项。我希望以最快的方式找到它们。你知道怎么做吗?提前谢谢!

1 个答案:

答案 0 :(得分:1)

您可以先将set转换为tuple,然后转换为aggregate nuniqueunique。上次使用boolean indexing

df = pd.DataFrame({'item':[1,2,3,4],
                  'attr':[set({1, 2, 3, 4}),set({2, 4, 3, 2, 10}),
                          set({4, 37}), set({1, 2, 3, 4})]})

print (df)
            attr  item
0   {1, 2, 3, 4}     1
1  {3, 10, 2, 4}     2
2        {4, 37}     3
3   {1, 2, 3, 4}     4

df.attr = df.attr.apply(tuple)
print (df)
            attr  item
0   (1, 2, 3, 4)     1
1  (3, 10, 2, 4)     2
2        (4, 37)     3
3   (1, 2, 3, 4)     4

df1 = df.item.groupby(df['attr']).agg(['nunique', 'unique'])
df1 = df1[df1['nunique'] == 2]
print (df1)
              nunique  unique
attr                         
(1, 2, 3, 4)        2  [1, 4]

如果DataFrame中只有一个或一对值duplicated,则为另一种解决方案:

df = pd.DataFrame({'item':[1,2,3,4],
                  'attr':[set({1, 2, 3, 4}),set({4, 37}),
                          set({4, 37}), set({1, 2, 3, 4})]})

print (df)
           attr  item
0  {1, 2, 3, 4}     1
1       {4, 37}     2
2       {4, 37}     3
3  {1, 2, 3, 4}     4

df.attr = df.attr.apply(tuple)


df1 = df[df.duplicated('attr', keep=False)]
df1 = df1.groupby('attr')['item'].apply(lambda x: x.tolist())
print (df1)
(1, 2, 3, 4)    [1, 4]
(4, 37)         [2, 3]
Name: item, dtype: object

通过评论编辑:

使用melt进行重塑:

df = pd.DataFrame({'item':[1,2,3,4,5],
                  'attr1':[set({1, 2, 3, 4}),set({4, 37}),set({4, 37}), 
                           set({1, 2, 3, 4}), set({4,8})],
                  'attr2':[set({1, 2 }),set({4, 37}),
                           set({4, 3}), set({1, 2}), set({4,8})]})

print (df)
          attr1    attr2  item
0  {1, 2, 3, 4}   {1, 2}     1
1       {4, 37}  {4, 37}     2
2       {4, 37}   {3, 4}     3
3  {1, 2, 3, 4}   {1, 2}     4
4        {8, 4}   {8, 4}     5

df = pd.melt(df, id_vars='item', value_name='attr').drop('variable', axis=1)
df.attr = df.attr.apply(tuple)
print (df)
   item          attr
0     1  (1, 2, 3, 4)
1     2       (4, 37)
2     3       (4, 37)
3     4  (1, 2, 3, 4)
4     5        (8, 4)
5     1        (1, 2)
6     2       (4, 37)
7     3        (3, 4)
8     4        (1, 2)
9     5        (8, 4)