Question

我正在尝试对大型网络对象执行计算，以对节点之间出现的链接执行某些预测。我可以使用Pythons multiprocessing以串行方式执行此操作，但不能并行执行此操作。这个函数似乎永远不会从并行实现中看到我的任务管理器它似乎不占用大量的内存或CPU功率

def jaccard_serial_predictions(G):
    """
    Create a ranked list of possible new links based on the Jaccard similarity,
    defined as the intersection of nodes divided by the union of nodes

    parameters
    G: Directed or undirected nx graph
    returns
    list of linkbunches with the score as an attribute
    """
    potential_edges = []
    G_undirected = nx.Graph(G)
    for non_edge in nx.non_edges(G_undirected):
        u = set(G.neighbors(non_edge[0]))
        v = set(G.neighbors(non_edge[1]))
        uv_un = len(u.union(v))
        uv_int = len(u.intersection(v))
        if uv_int == 0 or uv_un == 0:
            continue
        else:
            s = (1.0*uv_int)/uv_un

        potential_edges.append(non_edge + ({'score': s},))

    return potential_edges

def jaccard_prediction(non_edge):
    u = set(G.neighbors(non_edge[0]))
    v = set(G.neighbors(non_edge[1]))
    uv_un = len(u.union(v))
    uv_int = len(u.intersection(v))
    if uv_int == 0 or uv_un == 0:
        return
    else:
        s = (1.0*uv_int)/uv_un
    return non_edge + ({'score': s},)

def jaccard_mp_predictions(G):
    """
    Create a ranked list of possible new links based on the Jaccard similarity,
    defined as the intersection of nodes divided by the union of nodes

    parameters
    G: Directed or undirected nx graph
    returns
    list of linkbunches with the score as an attribute
    """
    pool = mp.Pool(processes=4)
    G_undirected = nx.Graph(G)
    results = pool.map(jaccard_prediction, nx.non_edges(G_undirected))
    return results

调用jaccard_serial_predictions(G)，G为95000000个潜在边缘的图表，在4.5分钟内返回，但jaccard_mp_predictions(G)即使在运行半小时后也不会返回。

Answer 1

我不确定这一点，但我认为我发现潜在的放缓。比较每个节点上串行操作的代码：

u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
    continue
else:
    s = (1.0*uv_int)/uv_un
potential_edges.append(non_edge + ({'score': s},))

用于并行操作：

u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
    return
else:
    s = (1.0*uv_int)/uv_un
return non_edge + ({'score': s},)

在串行版本中，只要此条件uv_int == 0 or uv_un == 0为真，就跳过添加到列表中。但在并行化版本中，您返回None。

映射操作不够智能，不能将None添加到列表中，而串行操作只是跳过这些元素。由于并行版本中每个不可记分元素的附加追加操作，这可能导致减速。如果你有很多，那么放缓可能会很大！

多线程python函数没有返回

1 个答案: