Question

我正在整理一些基本的python代码，它接收映射到矩阵列表的标签字典（矩阵代表分类图像），我只是试图从所有内容中减去平均图像然后将数据置于中心0-1的比例。出于某种原因，这段代码似乎运行得很慢。当仅对500个48x48图像进行迭代时，运行大约需要10秒，这不会真正扩展到我正在使用的图像数量。在查看cProfile结果后，看起来大部分时间都花在_center函数上。

我觉得我可能不会在这里充分利用numpy，并且想知道是否有一个比我更有经验的人有一些技巧可以加快这一点，或者可以指出我在这里做的傻事。代码发布如下：

def __init__(self, master_dict, normalization = lambda x: math.exp(x)):
    """
    master_dict should be a dictionary mapping classes to lists of matrices

    example = {
        "cats": [[[]...], [[]...]...],
        "dogs": [[[]...], [[]...]...]
    }

    have to be python lists, not numpy arrays

    normalization represents the 0-1 normalization scheme used. Defaults to simple linear
    """
    normalization = np.vectorize(normalization)
    full_tensor = np.array(reduce(operator.add, master_dict.values()))
    centering = np.sum(np.array(reduce(operator.add, master_dict.values())), axis=0)/len(full_tensor)
    self.data = {key: self._center(np.array(value), centering, normalization) for key,value in master_dict.items()}
    self.normalization = normalization

def _center(self, list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    arrays = list_of_arrays - centering_factor
    normalize = lambda a: (a - np.min(a)) / (np.max(a) - np.min(a))
    return normalization_scheme([normalize(array) for array in arrays])

此外，在您提出要求之前，我对输入格式没有太大的控制权，但如果这真的是限制因素，我可能会想出一些东西。

Answer 1

从@ sethMMorton的变化开始，我能够获得近两倍的速度因子。主要从矢量化您的normalize功能（内部_center），这样就可以调用_center上的全部的list_of_arrays，而不是仅仅把它在列表理解中。这也消除了从numpy数组到列表和返回的额外转换。

def normalize(a):
    a -= a.min(1, keepdims=True).min(2, keepdims=True)
    a /= a.max(1, keepdims=True).max(2, keepdims=True)
    return a

请注意，我不会在normalize调用中定义_center，而是将其分开，如此答案中所示。那么，在_center中，只需在整个normalize上调用list_of_arrays：

def _center(self, list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    list_of_arrays -= centering_factor
    return normalization_scheme(normalize(list_of_arrays))

其实，你可以调用normalize和_center在整个full_tensor在开始的时候，从来都通过循环，但棘手的部分是拆分它备份成再次列出数组列表。我将继续研究下一篇：P

如我的评论中所述，您可以替换：

full_tensor = np.array(reduce(operator.add, master_dict.values()))

与

full_tensor = np.concatenate(master_dict.values())

这可能不会更快，但它更清晰，也是标准的方法。

最后，这是时间：

>>> timeit slater_init(example)
1 loops, best of 3: 1.42 s per loop

>>> timeit seth_init(example)
1 loops, best of 3: 489 ms per loop

>>> timeit my_init(example)
1 loops, best of 3: 281 ms per loop

以下是我的完整时间代码。请注意，我将self.data = ...替换为return ...，以便我可以保存并比较输出，以确保我们的所有代码都返回相同的数据:)当然，您也应该测试您的版本！

import operator
import math
import numpy as np

#example dict has N keys (integers), each value is a list of n random HxW 'arrays', in list form:
test_shape = 10, 2, 4, 4          # small example for testing
timing_shape = 100, 5, 48, 48     # bigger example for timing
N, n, H, W = timing_shape
example = dict(enumerate(np.random.rand(N, n, H, W).tolist()))

def my_init(master_dict, normalization=np.exp):
    full_tensor = np.concatenate(master_dict.values())
    centering = np.mean(full_tensor, 0)
    return {key: my_center(np.array(value), centering, normalization)
                     for key,value in master_dict.iteritems()} #use iteritems here
    #self.normalization = normalization

def my_normalize(a):
    a -= a.min(1, keepdims=True).min(2, keepdims=True)
    a /= a.max(1, keepdims=True).max(2, keepdims=True)
    return a

def my_center(arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    arrays -= centering_factor
    return normalization_scheme(my_normalize(arrays))

#### sethMMorton's original improvement ####

def seth_init(master_dict, normalization = np.exp):
    """
    master_dict should be a dictionary mapping classes to lists of matrices

    example = {
        "cats": [[[]...], [[]...]...],
        "dogs": [[[]...], [[]...]...]
    }

    have to be python lists, not numpy arrays

    normalization represents the 0-1 normalization scheme used. Defaults to simple linear
    """
    full_tensor = np.array(reduce(operator.add, master_dict.values()))
    centering = np.sum(full_tensor, axis=0)/len(full_tensor)
    return {key: seth_center(np.array(value), centering, normalization) for key,value in master_dict.items()}
    #self.normalization = normalization

def seth_center(list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    def seth_normalize(a):
        a_min = np.min(a)
        return (a - a_min) / (np.max(a) - a_min)
    arrays = list_of_arrays - centering_factor
    return normalization_scheme([seth_normalize(array) for array in arrays])

#### Original code, by slater ####

def slater_init(master_dict, normalization = lambda x: math.exp(x)):
    """
    master_dict should be a dictionary mapping classes to lists of matrices

    example = {
        "cats": [[[]...], [[]...]...],
        "dogs": [[[]...], [[]...]...]
    }

    have to be python lists, not numpy arrays

    normalization represents the 0-1 normalization scheme used. Defaults to simple linear
    """
    normalization = np.vectorize(normalization)
    full_tensor = np.array(reduce(operator.add, master_dict.values()))
    centering = np.sum(np.array(reduce(operator.add, master_dict.values())), axis=0)/len(full_tensor)
    return {key: slater_center(np.array(value), centering, normalization) for key,value in master_dict.items()}
    #self.normalization = normalization

def slater_center(list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    arrays = list_of_arrays - centering_factor
    slater_normalize = lambda a: (a - np.min(a)) / (np.max(a) - np.min(a))
    return normalization_scheme([slater_normalize(array) for array in arrays])

Answer 2

除了似乎有效的math.exp -> np.exp建议之外，我还建议了一些其他修改。首先，您正在进行两次计算np.array(reduce(operator.add, master_dict.values()))，因此在下面的返工中，我建议重复使用数据，而不是两次完成工作。其次，我将normalize lambda修改为适当的函数，以便您可以预先计算数组的最小值。这节省了两次计算。

def __init__(self, master_dict, normalization = np.exp):
    """
    master_dict should be a dictionary mapping classes to lists of matrices

    example = {
        "cats": [[[]...], [[]...]...],
        "dogs": [[[]...], [[]...]...]
    }

    have to be python lists, not numpy arrays

    normalization represents the 0-1 normalization scheme used. Defaults to simple linear
    """
    full_tensor = np.array(reduce(operator.add, master_dict.values()))
    centering = np.sum(full_tensor, axis=0)/len(full_tensor)
    self.data = {key: self._center(np.array(value), centering, normalization) for key,value in master_dict.items()}
    self.normalization = normalization

def _center(self, list_of_arrays, centering_factor, normalization_scheme):
    """
    Centering scheme for arrays
    """
    def normalize(a):
        a_min = np.min(a)
        return (a - a_min) / (np.max(a) - a_min)
    arrays = list_of_arrays - centering_factor
    return normalization_scheme([normalize(array) for array in arrays])

我认为您对需要执行特定于python的事情的评论，以便在操作数据之前无法转换为arrays，没有什么能阻止您调用（例如）reduce numpy数组。 Numpy数组是可迭代的，所以在任何你使用列表的地方你都可以使用numpy数组（好的，不是任何地方，但在大多数情况下）。但是，我还没有完全熟悉你的算法，也许这种情况是例外之一。

Numpy规范化代码非常慢

2 个答案: