我正在解决一个问题,我必须将相关项目分组并为其分配唯一ID。我已经在python中编写了代码,但它没有返回预期的输出。我需要帮助改进我的逻辑。代码如下:
data = {}
child_list = []
for index, row in df.iterrows():
parent = row['source']
child = row['target']
#print 'Parent: ', parent
#print 'Child:', child
child_list.append(child)
#print child_list
if parent not in data.keys():
data[parent] = []
if parent != child:
data[parent].append(child)
#print data
op = {}
gid = 0
def recursive(op,x,gid):
if x in data.keys() and data[x] != []:
for x_child in data[x]:
if x_child in data.keys():
op[x_child] = gid
recursive(op,x_child,gid)
else:
op[x] = gid
else:
op[x] = gid
for key in data.keys():
#print "Key: ", key
if key not in child_list:
gid = gid + 1
op[key] = gid
for x in data[key]:
op[x] = gid
recursive(op,x,gid)
related = pd.DataFrame({'items':op.keys(),
'uniq_group_id': op.values()})
mapped.sort_values('items')
以下示例
Input:
source target
a b
b c
c c
c d
d d
e f
a d
h a
i f
Desired Output:
item uniq_group_id
a 1
b 1
c 1
d 1
h 1
e 2
f 2
i 2
我的代码告诉我输出错误。
item uniq_group_id
a 3
b 3
c 3
d 3
e 1
f 2
h 3
i 2
另一个例子
Input:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
'target':['b','c','c','d','d','f','d','a','f','a']})
Desired Output:
item uniq_group_id
a 1
b 1
c 1
d 1
e 2
f 2
My code Output:
item uniq_group_id
e 1
f 1
行或组ID的顺序无关紧要。这里重要的是为相关项分配相同的唯一标识符。整个问题是找到相关的项目组,并为它们分配一个唯一的组ID。
答案 0 :(得分:1)
我没有仔细分析你的代码,但看起来错误是因为你填充data
字典的方式。它将子节点存储为其父节点的邻居,但它还需要将父节点存储为子节点的邻居。
我没有尝试修改你的代码,而是决定改编Aseem Goyal撰写的this pseudocode。下面的代码从简单的Python列表中获取其输入数据,但应该很容易使其适应Pandas数据帧。
''' Find all the connected components of an undirected graph '''
from collections import defaultdict
src = ['a', 'b', 'c', 'c', 'd', 'e', 'a', 'h', 'i', 'a']
tgt = ['b', 'c', 'c', 'd', 'd', 'f', 'd', 'a', 'f', 'a']
nodes = sorted(set(src + tgt))
print('Nodes', nodes)
neighbors = defaultdict(set)
for u, v in zip(src, tgt):
neighbors[u].add(v)
neighbors[v].add(u)
print('Neighbors')
for n in nodes:
print(n, neighbors[n])
visited = {}
def depth_first_traverse(node, group_id):
for n in neighbors[node]:
if n not in visited:
visited[n] = group_id
depth_first_traverse(n, group_id)
print('Groups')
group_id = 1
for n in nodes:
if n not in visited:
visited[n] = group_id
depth_first_traverse(n, group_id)
group_id += 1
print(n, visited[n])
<强>输出强>
Nodes ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i']
Neighbors
a {'a', 'd', 'b', 'h'}
b {'a', 'c'}
c {'d', 'b', 'c'}
d {'d', 'a', 'c'}
e {'f'}
f {'i', 'e'}
h {'a'}
i {'f'}
Groups
a 1
b 1
c 1
d 1
e 2
f 2
h 1
i 2
此代码是为Python 3编写的,但也将在Python 2上运行。如果您在Python 2上运行它,则应在import语句的顶部添加from __future__ import print_function
;它不是绝对必要的,但它会使输出看起来更好。
答案 1 :(得分:1)
您可以使用Union-Find, or Disjoint-Sets algorithm。有关更完整的说明,请参阅this related answer。基本上,您需要两个函数union
和find
来创建leaders
或前辈的树(即嵌套字典):
leaders = collections.defaultdict(lambda: None)
def find(x):
l = leaders[x]
if l is not None:
l = find(l)
leaders[x] = l
return l
return x
def union(x, y):
lx, ly = find(x), find(y)
if lx != ly:
leaders[lx] = ly
您可以按如下方式将此问题应用于您的问题:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
'target': ['b','c','c','d','d','f','d','a','f','a']})
# build the tree
for _, row in df.iterrows():
union(row["source"], row["target"])
# build groups based on leaders
groups = collections.defaultdict(set)
for x in leaders:
groups[find(x)].add(x)
for num, group in enumerate(groups.values(), start=1):
print(num, group)
结果:
1 {'e', 'f', 'i'}
2 {'h', 'a', 'c', 'd', 'b'}