Question

我用Python编写了一个程序，花费大量时间从字典键中查找对象和值的属性。我想知道是否有任何方法可以优化这些查找时间，可能使用C扩展，以减少执行时间，或者如果我只需要用编译语言重新实现程序。

该程序使用图形实现一些算法。它在我们的数据集上运行得非常慢，所以我用cProfile使用可以实际完成的简化数据集来分析代码。 浩大的大多数时间都在一个函数中被烧掉，特别是在函数中的两个语句中，生成器表达式：

第202行的生成器表达式是

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

并且第204行的生成器表达式是

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

下面提供了此上下文功能的源代码。

selected_nodes是interaction_graph中set个节点，是NetworkX Graph个实例。 node_neighbors是来自Graph.neighbors_iter()的迭代器。

Graph本身使用字典来存储节点和边缘。其Graph.node属性是一个字典，用于在属于每个节点的字典中存储节点及其属性（例如'weight'）。

这些查找中的每一个应该是分摊的常量时间（即O（1）），但是，我仍然为查找付出了很大的代价。有什么方法可以加快这些查找速度（例如，通过将其中的部分内容写为C扩展名），还是需要将程序移动到编译语言？

以下是提供上下文的函数的完整源代码;绝大多数执行时间都花在这个函数中。

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

以下是根据cProfiler调用的最重要的一对，按时间排序。

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 {max}
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 {method 'update' of 'set' objects}

Answer 1

如何保持interaction_graph.neighbors_iter（node）的迭代顺序排序（或使用collections.heapq进行部分排序）？由于您只是试图找到最大值，您可以按降序迭代node_neighbors，selected_node中的第一个节点必须是selected_node中的最大值。

其次，selected_node会多久更改一次？如果它很少变化，你可以通过在selected_node中为x的“interaction_graph.node [neighbor]”列表来节省大量的迭代，而不必每次都重建这个列表。

编辑：回复评论

sort（）需要O（n log n）

不一定，你对教科书的看法太过分了。尽管你的教科书说的是，你可以有时通过利用你的数据的某些结构打破O（n log n）障碍。如果您将邻居列表保留在自然排序的数据结构中（例如heapq，二叉树），则无需在每次迭代时重新排序。当然这是一个时空权衡，因为您需要存储冗余的邻居列表，并且存在代码复杂性以确保在邻居更改时更新邻居列表。

另外，使用timsort算法的python的list.sort（）对于几乎排序的数据来说非常快（在某些情况下可以平均为O（n））。它仍然没有打破O（n log n），已经证明在数学上已经有很多不可能了。

您需要在解除解决方案之前进行分析，因为这不太可能提高性能。在进行极端优化时，您可能会发现在某些非常特殊的边缘情况下旧的慢速冒泡排序可能会赢得一个美化的快速排序或合并排序。

Answer 2

我不明白为什么你的“权重”查询必须采用["weight"]（节点是字典？）而不是.weight（节点是对象）的形式。

如果您的节点是对象，并且没有很多字段，则可以利用the __slots__ directive来优化其存储空间：

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

编辑：所以我查看了您提供的NetworkX链接，有几件事情让我烦恼。首先，在顶部，“词典”的定义是“FIXME”。

总的来说，似乎坚持使用字典而不是使用可以子类化的类来存储属性。虽然对象上的属性查找可能本质上是字典查找，但我不知道如何使用对象更糟。如果有的话，它可能更好，因为对象属性查找更有可能被优化，因为：

对象属性查找非常常见，
对象属性的键空间比字典键更受限制，因此可以在搜索中使用优化的比较算法，
对象具有针对这些情况的__slots__优化，其中您的对象只有几个字段，并且需要对它们进行优化访问。

例如，我经常在表示坐标的类上使用__slots__。对我来说，树节点似乎是另一个显而易见的用途。

这就是为什么当我读到：

<强>节点
节点可以是除None之外的任何可散列Python对象。

我想，好吧，没问题，但紧接着就是

节点属性
在添加节点或为指定节点n分配G.node [n]属性字典时，节点可以通过使用关键字/值对将任意Python对象分配为属性。

我认为，如果节点需要属性，为什么要单独存储？为什么不把它放在节点中？写一个课程，contentString和weight成员有害吗？边缘似乎更加疯狂，因为它们被指定为元组而不是你可以继承的对象。

所以我对NetworkX背后的设计决策感到很遗憾。

如果您坚持使用它，我建议将这些词典中的属性移动到实际节点中，或者如果这不是一个选项，则使用键将整数用于属性字典而不是字符串，因此搜索使用速度更快比较算法。

最后，如果你合并了你的发电机怎么办？

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)

Answer 3

尝试直接访问dict并捕获KeyErrors，它可能会更快，具体取决于您的命中/未命中率：

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

或使用getdefault和list comprehension：

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)

Answer 4

如果不深入研究您的代码，请尝试使用itertools添加一点速度。

在模块级别添加：

import itertools as it, operator as op
GET_WEIGHT= op.attrgetter('weight')

变化：

neighbors_in_selected_nodes = (neighbor for neighbor in
        node_neighbors if neighbor in selected_nodes)

成：

neighbors_in_selected_nodes = it.ifilter(selected_node.__contains__, node_neighbors)

和

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in neighbors_in_selected_nodes)

成：

neighbor_z_scores = (
    it.imap(
        GET_WEIGHT,
        it.imap(
            interaction_graph.node.__getitem__,
            neighbors_in_selected_nodes)
    )
)

这些帮助吗？

使用许多属性和字典查找优化Python代码

4 个答案: