Question

我可以通过写一个for循环来解决我的任务，但我想知道，如何以更好的方式做到这一点。

所以我有这个数据框存储了一些列表，并希望找到这些列表中具有任何常见值的所有行，

（此代码只是为了获得带有列表的df：

>>> df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]})
>>> df
   a  b
0  A  1
1  A  2
2  B  5
3  B  1
4  B  4
5  C  6
>>> d = df.groupby('a')['b'].apply(list)

）

我们开始：

>>> d

A       [1, 2]
B    [5, 1, 4]
C          [6]
Name: b, dtype: object

我想选择索引为＆＃39; A＆＃39;和＆＃39; B＆＃39;，因为他们的列表与值1重叠。

我现在可以写一个for循环或者扩展这些列表中的数据帧（颠倒我上面的方式）并且有多行复制其他值。你会在这做什么？或者是否有某种方法，使用df.groupby（by = lambda x，y：return not set（x）.isdisjoint（y）），比较两行？但是groupby和boolean masking只是一次看一个元素......

我现在尝试重载列表的相等运算符，因为列表不可清除，然后是元组和集合（我将哈希设置为1以避免身份比较）。然后我使用groupby并在框架上与自身合并，但看起来，它检查了索引，它已经匹配。

import pandas as pd
import numpy as np
from operator import itemgetter


class IndexTuple(set):
    def __hash__(self):
        #print(hash(str(self)))
        return hash(1)
    def __eq__(self, other):

        #print("eq ")
        is_equal = not set(self).isdisjoint(other)

        return is_equal

l = IndexTuple((1,7))


l1 = IndexTuple((4, 7))

print  (l == l1)

df = pd.DataFrame(np.random.randint(low=0, high=4, size=(10, 2)), columns=['a','b']).reset_index()
d = df.groupby('a')['b'].apply(IndexTuple).to_frame().reset_index()

print (d)

print (d.groupby('b').b.apply(list))

print (d.merge (d, on = 'b', how = 'outer'))

输出（它适用于第一个元素，但在[{3}]时应该[{3},{0,3}]代替：

True
   a       b
0  0     {1}
1  1  {0, 2}
2  2     {3}
3  3  {0, 3}

b
{1}                  [{1}]
{0, 2}    [{0, 2}, {0, 3}]
{3}                  [{3}]

Name: b, dtype: object
   a_x       b  a_y
0    0     {1}    0
1    1  {0, 2}    1
2    1  {0, 2}    3
3    3  {0, 3}    1
4    3  {0, 3}    3
5    2     {3}    2

Answer 1

在merge上使用df：

v = df.merge(df, on='b')
common_cols = set(
    np.sort(v.iloc[:, [0, -1]].query('a_x != a_y'), axis=1).ravel()
)

common_cols
{'A', 'B'}

现在，预过滤并致电groupby：

df[df.a.isin(common_cols)].groupby('a').b.apply(list)
a
A       [1, 2]
B    [5, 1, 4]
Name: b, dtype: object

Answer 2

我知道你要求的是一个“pandorable”解决方案，但在我看来，这个任务非常适合pandas。

以下是一个使用collections.Counter和itertools.combinations的解决方案，它可以在不使用数据框的情况下提供结果。

from collections import defaultdict
from itertools import combinations

data = {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]}

d = defaultdict(set)

for i, j in zip(data['a'], data['b']):
    d[i].add(j)

res = {frozenset({i, j}) for i, j in combinations(d, 2) if not d[i].isdisjoint(d[j])}

# {frozenset({'A', 'B'})}

<强>解释

分组到collections.defaultdict的集合。通过O（n）复杂性解决方案。
使用itertools.combinations进行迭代，使用集合理解来查找不相交的设定值。
使用frozenset（或排序tuple）作为密钥类型，因为列表是可变的，因此不能用作字典密钥。

在pandas dataframe中查找行，其中不同的行在存储列表的列中的列表中具有公共值

2 个答案: