Question

我正在解决一个问题，我必须将相关项目分组并为其分配唯一ID。我已经在python中编写了代码，但它没有返回预期的输出。我需要帮助改进我的逻辑。代码如下：

data = {}
child_list = []


for index, row in df.iterrows():
    parent = row['source']
    child = row['target']
    #print 'Parent: ', parent
    #print 'Child:', child
    child_list.append(child)
    #print child_list
    if parent not in data.keys():
        data[parent] = []
    if parent != child:
        data[parent].append(child)
    #print data

op = {}
gid = 0


def recursive(op,x,gid):
    if x in data.keys() and data[x] != []:
        for x_child in data[x]:
            if x_child in data.keys():
                op[x_child] = gid
                recursive(op,x_child,gid)
            else:
                op[x] = gid
    else:
        op[x] = gid


for key in data.keys():
    #print "Key: ", key
    if key not in child_list:
        gid = gid + 1
        op[key] = gid
        for x in data[key]:
            op[x] = gid
            recursive(op,x,gid)

related = pd.DataFrame({'items':op.keys(),
                  'uniq_group_id': op.values()})
mapped.sort_values('items')

以下示例

Input:
source  target
a        b
b        c
c        c
c        d
d        d
e        f
a        d
h        a
i        f  

Desired Output:
item     uniq_group_id
a         1 
b         1
c         1
d         1
h         1
e         2
f         2
i         2

我的代码告诉我输出错误。

item    uniq_group_id
a       3
b       3
c       3
d       3
e       1
f       2
h       3
i       2

另一个例子

Input:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                'target':['b','c','c','d','d','f','d','a','f','a']})
Desired Output:
item    uniq_group_id
a       1
b       1
c       1
d       1
e       2
f       2

My code Output:
item    uniq_group_id
e   1
f   1

行或组ID的顺序无关紧要。这里重要的是为相关项分配相同的唯一标识符。整个问题是找到相关的项目组，并为它们分配一个唯一的组ID。

Answer 1

我没有仔细分析你的代码，但看起来错误是因为你填充data字典的方式。它将子节点存储为其父节点的邻居，但它还需要将父节点存储为子节点的邻居。

我没有尝试修改你的代码，而是决定改编Aseem Goyal撰写的this pseudocode。下面的代码从简单的Python列表中获取其输入数据，但应该很容易使其适应Pandas数据帧。

''' Find all the connected components of an undirected graph '''

from collections import defaultdict

src = ['a', 'b', 'c', 'c', 'd', 'e', 'a', 'h', 'i', 'a']
tgt = ['b', 'c', 'c', 'd', 'd', 'f', 'd', 'a', 'f', 'a']

nodes = sorted(set(src + tgt))
print('Nodes', nodes)

neighbors = defaultdict(set)
for u, v in zip(src, tgt):
    neighbors[u].add(v)
    neighbors[v].add(u)

print('Neighbors')
for n in nodes:
    print(n, neighbors[n])

visited = {}
def depth_first_traverse(node, group_id):
    for n in neighbors[node]:
        if n not in visited:
            visited[n] = group_id
            depth_first_traverse(n, group_id)

print('Groups')
group_id = 1
for n in nodes:
    if n not in visited:
        visited[n] = group_id
        depth_first_traverse(n, group_id)
        group_id += 1
    print(n, visited[n])

<强>输出

Nodes ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i']
Neighbors
a {'a', 'd', 'b', 'h'}
b {'a', 'c'}
c {'d', 'b', 'c'}
d {'d', 'a', 'c'}
e {'f'}
f {'i', 'e'}
h {'a'}
i {'f'}
Groups
a 1
b 1
c 1
d 1
e 2
f 2
h 1
i 2

此代码是为Python 3编写的，但也将在Python 2上运行。如果您在Python 2上运行它，则应在import语句的顶部添加from __future__ import print_function;它不是绝对必要的，但它会使输出看起来更好。

Answer 2

您可以使用Union-Find, or Disjoint-Sets algorithm。有关更完整的说明，请参阅this related answer。基本上，您需要两个函数union和find来创建leaders或前辈的树（即嵌套字典）：

leaders = collections.defaultdict(lambda: None)

def find(x):
    l = leaders[x]
    if l is not None:
        l = find(l)
        leaders[x] = l
        return l
    return x

def union(x, y):
    lx, ly = find(x), find(y)
    if lx != ly:
        leaders[lx] = ly

您可以按如下方式将此问题应用于您的问题：

df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                   'target': ['b','c','c','d','d','f','d','a','f','a']})

# build the tree
for _, row in df.iterrows():
    union(row["source"], row["target"])

# build groups based on leaders
groups = collections.defaultdict(set)
for x in leaders:
    groups[find(x)].add(x)
for num, group in enumerate(groups.values(), start=1):
    print(num, group)

结果：

1 {'e', 'f', 'i'}
2 {'h', 'a', 'c', 'd', 'b'}

为组创建唯一ID

2 个答案: