Question

我在csv文件中有两列的数据集。该数据集的目的是在两个不同的id之间提供链接，如果它们属于同一个人的话。例如（2,3,5属于1）例如

 1. COLA COLB 1 2 ; 1 3 ; 1 5 ; 2 6 ; 3 7 ; 9 10

在上面的例子中，1链接到2,3,5，2连接到6，3链接到7。我想要实现的是识别直接（2,3,5）或间接（6,7）与1链接的所有记录，并且能够说B列中的这些ID属于同一个人在A列中，然后重复数据删除或向输出文件添加一个新列，该列将为所有链接到1的行填充1

预期产出的例子

 - colA  colB GroupField 1 2 1; 1 3 1;  1 5 1 ;
2 6 1 ;3 7 1; 9 10 9; 10 11 9

如何解决这个问题？

到目前为止，我已经能够读入该文件并创建一个字典。我已经研究过使用Python集合操作，但我无法将它们与字典一起使用。

我已经研究过将字典转换为集合的方法，然后使用集合运算符在集合之间进行重复数据删除，但无法在线找到任何内容，也不确定这是否是正确的方法解决这个问题。

Answer 1

您的输入是graph，many Python libraries可以帮助您分析一个。 NetworkX就是其中之一。

您正在图中查找connected components，并且a number of functions in NetworkX可以找到它们。

一些代码行可以帮助您入门：

import networkx as nx

file_contents = "1 2 ; 1 3 ; 1 5 ; 2 6 ; 3 7 ; 9 10"
lines = [item.strip() for item in file.split(";")]
G = nx.parse_edgelist(lines, nodetype = int)
components = nx.connected_components(G)
# components now holds:
# [[1, 2, 3, 5, 6, 7], [9, 10]]

Answer 2

Logc感谢您指点我正确的方向。我能够使用一点网络x并在强连接组件上使用Tarjan算法来解决这个问题。以下是我的代码： -

import networkx as nx
def strongly_connected_components(graph):
""" Find the strongly connected components in a graph using
    Tarjan's algorithm.

    graph should be a dictionary mapping node names to
    lists of successor nodes.
    """

    result = [ ]
    stack = [ ]
    low = { }

def visit(node):
    if node in low: return

num = len(low)
    low[node] = num
    stack_pos = len(stack)
    stack.append(node)

    for successor in graph[node]:
        visit(successor)
        low[node] = min(low[node], low[successor])

    if num == low[node]:
    component = tuple(stack[stack_pos:])
        del stack[stack_pos:]
        result.append(component)
    for item in component:
        low[item] = len(graph)

for node in graph:
    visit(node)

    return result


def topological_sort(graph):
     count = { }
     for node in graph:
        count[node] = 0
     for node in graph:
         for successor in graph[node]:
              count[successor] += 1

     ready = [ node for node in graph if count[node] == 0 ]

    result = [ ]
    while ready:
       node = ready.pop(-1)
       result.append(node)

       for successor in graph[node]:
           count[successor] -= 1
           if count[successor] == 0:
              ready.append(successor)

return result

def robust_topological_sort(graph):
""" First identify strongly connected components,
    then perform a topological sort on these components. """

components = strongly_connected_components(graph)

node_component = { }
for component in components:
    for node in component:
        node_component[node] = component

component_graph = { }
for component in components:
    component_graph[component] = [ ]

for node in graph:
    node_c = node_component[node]
    for successor in graph[node]:
        successor_c = node_component[successor]
        if node_c != successor_c:
            component_graph[node_c].append(successor_c) 

return topological_sort(component_graph)


if __name__ == '__main__':
  print robust_topological_sort({
    0 : [1],
    1 : [2],
    2 : [1,3],
    3 : [3],
})

graph = nx.read_edgelist('input_filename',
create_using=None,delimiter=',',nodetype=None,edgetype=None)
results = strongly_connected_components(graph)
f=open('output_filename','w')
for item in results:
  f.write(','.join(map(str,item)))
  f.write('\n')
f.close()

使用Python遍历父子数据集

2 个答案: