Question

我正在尝试使用Python 3中的networkx包生成大型无比例图形。这些是我的软件版本：

python --version
Python 3.7.3

pip --version
pip 19.0.3 from /home/user/bin/python3/lib/python3.7/site-packages/pip (python 3.7)

pip show networkx
Name: networkx
Version: 2.3
Summary: Python package for creating and manipulating graphs and networks

更具体地说，我需要生成顶点数分别为100K，1M和10M的无标度图。我的代码非常简洁：

n = 1000000 # 100K, then 1M, then 10M...
G = nx.scale_free_graph(n)

file_stamp = str(datetime.datetime.now()).split('.')[0].replace(' ', '_').replace(':', '-')
target_file_name = str(args.n) + "V_" + str(G.number_of_edges()) + "E_"  + file_stamp + ".tsv"
target_file_path = os.path.join(args.out_dir, target_file_name)

print("> Target edge file:\t\t{}".format(target_file_path))

with open(target_file_path, 'wb') as f:
    nx.write_edgelist(G, f, data = False)

对于n = 100000（十万），执行过程花费了几秒钟。但是，对于n = 1000000（一百万）或n = 10000000（一千万）来说，脚本已经运行了几天。我注意到内存使用量一直在缓慢增长。

我希望这些图形比进程当前所占用的内存更多，这将暗示生成器逻辑是罪魁祸首。由于时间的流逝，我开始认为生成过程很慢。

我去检查了networkx.scale_free_graph函数的来源：

@py_random_state(7)
def scale_free_graph(n, alpha=0.41, beta=0.54, gamma=0.05, delta_in=0.2,
                     delta_out=0, create_using=None, seed=None):
    """Returns a scale-free directed graph.
    Parameters
    ----------
    n : integer
        Number of nodes in graph
    alpha : float
        Probability for adding a new node connected to an existing node
        chosen randomly according to the in-degree distribution.
    beta : float
        Probability for adding an edge between two existing nodes.
        One existing node is chosen randomly according the in-degree
        distribution and the other chosen randomly according to the out-degree
        distribution.
    gamma : float
        Probability for adding a new node connected to an existing node
        chosen randomly according to the out-degree distribution.
    delta_in : float
        Bias for choosing nodes from in-degree distribution.
    delta_out : float
        Bias for choosing nodes from out-degree distribution.
    create_using : NetworkX graph constructor, optional
        The default is a MultiDiGraph 3-cycle.
        If a graph instance, use it without clearing first.
        If a graph constructor, call it to construct an empty graph.
    seed : integer, random_state, or None (default)
        Indicator of random number generation state.
        See :ref:`Randomness<randomness>`.
    Examples
    --------
    Create a scale-free graph on one hundred nodes::
    >>> G = nx.scale_free_graph(100)
    Notes
    -----
    The sum of `alpha`, `beta`, and `gamma` must be 1.
    References
    ----------
.. [1] B. Bollobás, C. Borgs, J. Chayes, and O. Riordan,
       Directed scale-free graphs,
       Proceedings of the fourteenth annual ACM-SIAM Symposium on
       Discrete Algorithms, 132--139, 2003. """


    def _choose_node(G, distribution, delta, psum):
        cumsum = 0.0
        # normalization
        r = seed.random()
        for n, d in distribution:
            cumsum += (d + delta) / psum
            if r < cumsum:
                break
        return n

    if create_using is None or not hasattr(create_using, '_adj'):
        # start with 3-cycle
        G = nx.empty_graph(3, create_using, default=nx.MultiDiGraph)
        G.add_edges_from([(0, 1), (1, 2), (2, 0)])
    else:
        G = create_using
    if not (G.is_directed() and G.is_multigraph()):
        raise nx.NetworkXError("MultiDiGraph required in create_using")

    if alpha <= 0:
        raise ValueError('alpha must be > 0.')
    if beta <= 0:
        raise ValueError('beta must be > 0.')
    if gamma <= 0:
        raise ValueError('gamma must be > 0.')

    if abs(alpha + beta + gamma - 1.0) >= 1e-9:
        raise ValueError('alpha+beta+gamma must equal 1.')

    number_of_edges = G.number_of_edges()
    while len(G) < n:
        psum_in = number_of_edges + delta_in * len(G)
        psum_out = number_of_edges + delta_out * len(G)
        r = seed.random()
        # random choice in alpha,beta,gamma ranges
        if r < alpha:
            # alpha
           # add new node v
            v = len(G)
            # choose w according to in-degree and delta_in
            w = _choose_node(G, G.in_degree(), delta_in, psum_in)
        elif r < alpha + beta:
            # beta
            # choose v according to out-degree and delta_out
            v = _choose_node(G, G.out_degree(), delta_out, psum_out)
            # choose w according to in-degree and delta_in
            w = _choose_node(G, G.in_degree(), delta_in, psum_in)
        else:
            # gamma
            # choose v according to out-degree and delta_out
            v = _choose_node(G, G.out_degree(), delta_out, psum_out)
            # add new node w
            w = len(G)
        G.add_edge(v, w)
        number_of_edges += 1
    return G

此代码的主循环将迭代等于n个顶点数量的时间。

无需进一步分析，在主循环内，_choose_node每次迭代至少调用一次，最多两次。在该函数内部，存在另一个循环，遍历输入/输出的程度（分布）。

我认为n增加时，_choose_node中的计算时间也增加。

在networkx中是否可以更快地实现这种无标度生成器？还是另一个库中的一个函数（没有语言限制），该函数会生成具有与此语义相同的无标度图？

Answer 1

也许有一些方法可以更有效地做到这一点；但是，您正在处理组合增长-这是超指数的。 https://medium.com/@TorBair/exponential-growth-isn-t-cool-combinatorial-growth-is-85a0b1fdb6a5

挑战在于，以这种方式在（n）个边上进行计算的增长速度要比指数快。您可能会使用更有效的算法，但由于您要处理原始数学问题，它们不会为您带来太大的进步。

Answer 2

减少时间的唯一方法是通过

编辑生成器代码

将r乘以psum（在循环之前）一次，而不是将cumsum除以psum n次。那应该节省n次不必要的划分。
用G.in_degree()（在函数循环内）替换G.in_degree(n)（在函数调用外）。这样，循环

  for n, d in distribution:

可以是类似

  for n in G:
    d = G.in_degree(n)

这样可以省去事先计算图中所有 all 个节点的 all in_degrees / out_degrees分布的麻烦，希望节点的随机选择会在某个时候停止尽早获得性能提升。

使用networkx快速生成无比例缩放的图形

2 个答案: