我在csv文件中有两列的数据集。该数据集的目的是在两个不同的id之间提供链接,如果它们属于同一个人的话。例如(2,3,5属于1) 例如
1. COLA COLB 1 2 ; 1 3 ; 1 5 ; 2 6 ; 3 7 ; 9 10
在上面的例子中,1链接到2,3,5,2连接到6,3链接到7。 我想要实现的是识别直接(2,3,5)或间接(6,7)与1链接的所有记录,并且能够说B列中的这些ID属于同一个人在A列中,然后重复数据删除或向输出文件添加一个新列,该列将为所有链接到1的行填充1
预期产出的例子
- colA colB GroupField 1 2 1; 1 3 1; 1 5 1 ;
2 6 1 ;3 7 1; 9 10 9; 10 11 9
如何解决这个问题?
到目前为止,我已经能够读入该文件并创建一个字典。我已经研究过使用Python集合操作,但我无法将它们与字典一起使用。
我已经研究过将字典转换为集合的方法,然后使用集合运算符在集合之间进行重复数据删除,但无法在线找到任何内容,也不确定这是否是正确的方法解决这个问题。
答案 0 :(得分:0)
您的输入是graph,many Python libraries可以帮助您分析一个。 NetworkX就是其中之一。
您正在图中查找connected components,并且a number of functions in NetworkX可以找到它们。
一些代码行可以帮助您入门:
import networkx as nx
file_contents = "1 2 ; 1 3 ; 1 5 ; 2 6 ; 3 7 ; 9 10"
lines = [item.strip() for item in file.split(";")]
G = nx.parse_edgelist(lines, nodetype = int)
components = nx.connected_components(G)
# components now holds:
# [[1, 2, 3, 5, 6, 7], [9, 10]]
答案 1 :(得分:0)
Logc感谢您指点我正确的方向。我能够使用一点网络x并在强连接组件上使用Tarjan算法来解决这个问题。以下是我的代码: -
import networkx as nx
def strongly_connected_components(graph):
""" Find the strongly connected components in a graph using
Tarjan's algorithm.
graph should be a dictionary mapping node names to
lists of successor nodes.
"""
result = [ ]
stack = [ ]
low = { }
def visit(node):
if node in low: return
num = len(low)
low[node] = num
stack_pos = len(stack)
stack.append(node)
for successor in graph[node]:
visit(successor)
low[node] = min(low[node], low[successor])
if num == low[node]:
component = tuple(stack[stack_pos:])
del stack[stack_pos:]
result.append(component)
for item in component:
low[item] = len(graph)
for node in graph:
visit(node)
return result
def topological_sort(graph):
count = { }
for node in graph:
count[node] = 0
for node in graph:
for successor in graph[node]:
count[successor] += 1
ready = [ node for node in graph if count[node] == 0 ]
result = [ ]
while ready:
node = ready.pop(-1)
result.append(node)
for successor in graph[node]:
count[successor] -= 1
if count[successor] == 0:
ready.append(successor)
return result
def robust_topological_sort(graph):
""" First identify strongly connected components,
then perform a topological sort on these components. """
components = strongly_connected_components(graph)
node_component = { }
for component in components:
for node in component:
node_component[node] = component
component_graph = { }
for component in components:
component_graph[component] = [ ]
for node in graph:
node_c = node_component[node]
for successor in graph[node]:
successor_c = node_component[successor]
if node_c != successor_c:
component_graph[node_c].append(successor_c)
return topological_sort(component_graph)
if __name__ == '__main__':
print robust_topological_sort({
0 : [1],
1 : [2],
2 : [1,3],
3 : [3],
})
graph = nx.read_edgelist('input_filename',
create_using=None,delimiter=',',nodetype=None,edgetype=None)
results = strongly_connected_components(graph)
f=open('output_filename','w')
for item in results:
f.write(','.join(map(str,item)))
f.write('\n')
f.close()