使用Python遍历父子数据集

时间:2014-05-28 01:15:20

标签: python recursion csv

我在csv文件中有两列的数据集。该数据集的目的是在两个不同的id之间提供链接,如果它们属于同一个人的话。例如(2,3,5属于1) 例如

 1. COLA COLB 1 2 ; 1 3 ; 1 5 ; 2 6 ; 3 7 ; 9 10

在上面的例子中,1链接到2,3,5,2连接​​到6,3链接到7。 我想要实现的是识别直接(2,3,5)或间接(6,7)与1链接的所有记录,并且能够说B列中的这些ID属于同一个人在A列中,然后重复数据删除或向输出文件添加一个新列,该列将为所有链接到1的行填充1

预期产出的例子

 - colA  colB GroupField 1 2 1; 1 3 1;  1 5 1 ;
2 6 1 ;3 7 1; 9 10 9; 10 11 9

如何解决这个问题?

到目前为止,我已经能够读入该文件并创建一个字典。我已经研究过使用Python集合操作,但我无法将它们与字典一起使用。

我已经研究过将字典转换为集合的方法,然后使用集合运算符在集合之间进行重复数据删除,但无法在线找到任何内容,也不确定这是否是正确的方法解决这个问题。

2 个答案:

答案 0 :(得分:0)

您的输入是graphmany Python libraries可以帮助您分析一个。 NetworkX就是其中之一。

您正在图中查找connected components,并且a number of functions in NetworkX可以找到它们。

一些代码行可以帮助您入门:

import networkx as nx

file_contents = "1 2 ; 1 3 ; 1 5 ; 2 6 ; 3 7 ; 9 10"
lines = [item.strip() for item in file.split(";")]
G = nx.parse_edgelist(lines, nodetype = int)
components = nx.connected_components(G)
# components now holds:
# [[1, 2, 3, 5, 6, 7], [9, 10]]

答案 1 :(得分:0)

Logc感谢您指点我正确的方向。我能够使用一点网络x并在强连接组件上使用Tarjan算法来解决这个问题。以下是我的代码: -

import networkx as nx
def strongly_connected_components(graph):
""" Find the strongly connected components in a graph using
    Tarjan's algorithm.

    graph should be a dictionary mapping node names to
    lists of successor nodes.
    """

    result = [ ]
    stack = [ ]
    low = { }

def visit(node):
    if node in low: return

num = len(low)
    low[node] = num
    stack_pos = len(stack)
    stack.append(node)

    for successor in graph[node]:
        visit(successor)
        low[node] = min(low[node], low[successor])

    if num == low[node]:
    component = tuple(stack[stack_pos:])
        del stack[stack_pos:]
        result.append(component)
    for item in component:
        low[item] = len(graph)

for node in graph:
    visit(node)

    return result


def topological_sort(graph):
     count = { }
     for node in graph:
        count[node] = 0
     for node in graph:
         for successor in graph[node]:
              count[successor] += 1

     ready = [ node for node in graph if count[node] == 0 ]

    result = [ ]
    while ready:
       node = ready.pop(-1)
       result.append(node)

       for successor in graph[node]:
           count[successor] -= 1
           if count[successor] == 0:
              ready.append(successor)

return result

def robust_topological_sort(graph):
""" First identify strongly connected components,
    then perform a topological sort on these components. """

components = strongly_connected_components(graph)

node_component = { }
for component in components:
    for node in component:
        node_component[node] = component

component_graph = { }
for component in components:
    component_graph[component] = [ ]

for node in graph:
    node_c = node_component[node]
    for successor in graph[node]:
        successor_c = node_component[successor]
        if node_c != successor_c:
            component_graph[node_c].append(successor_c) 

return topological_sort(component_graph)


if __name__ == '__main__':
  print robust_topological_sort({
    0 : [1],
    1 : [2],
    2 : [1,3],
    3 : [3],
})

graph = nx.read_edgelist('input_filename',
create_using=None,delimiter=',',nodetype=None,edgetype=None)
results = strongly_connected_components(graph)
f=open('output_filename','w')
for item in results:
  f.write(','.join(map(str,item)))
  f.write('\n')
f.close()