17,000乘300矩阵的每个行组合之间的平方差的总和

时间:2016-10-04 06:55:21

标签: python numpy optimization matrix

好的,我有一个包含17000行(示例)和300列(功能)的矩阵。我想基本计算每个可能的行组合之间的欧几里德距离,因此每个可能的行对的平方差的总和。 显然它很多,iPython,虽然没有完全崩溃我的笔记本电脑,说“(忙)”一段时间,然后我不能再运行任何东西,它似乎已经放弃了,即使我可以移动我的鼠标和一切

有没有办法让这项工作?这是我写的功能。我到处都用numpy。 我正在做的是将差异存储在每个可能组合的差异矩阵中。我知道矩阵的下对角线部分=上对角线,但这只会节省1/2的计算时间(比没有更好,但我认为不是改变游戏规则)。

编辑:我刚尝试使用scipy.spatial.distance.pdist,但现在已经运行了一段时间,看不到尽头,有更好的方法吗?我还应该提一下,我在那里有NaN值......但这显然不是numpy的问题。

features = np.array(dataframe)
distances = np.zeros((17000, 17000))


def sum_diff():
    for i in range(17000):
        for j in range(17000):
            diff = np.array(features[i] - features[j])
            diff = np.square(diff)
            sumsquares = np.sum(diff)
            distances[i][j] = sumsquares

3 个答案:

答案 0 :(得分:2)

您总是可以将计算时间除以2,注意到d(​​i,i)= 0和d(i,j)= d(j,i)。

但你有没看过sklearn.metrics.pairwise.pairwise_distances()(第0.18节,见the doc here)?

您可以将其用作:

from sklearn.metrics import pairwise
import numpy as np

a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)

答案 1 :(得分:1)

与numpy一样重要的是避免使用循环并让它通过向量化操作发挥作用,因此有一些基本的改进可以节省一些计算时间:

import numpy as np
import timeit

#I reduced the problem size to 1000*300 to keep the timing in reasonable range
n=1000
features = np.random.rand(n,300)
distances = np.zeros((n,n))


def sum_diff():
    for i in range(n):
        for j in range(n):
            diff = np.array(features[i] - features[j])
            diff = np.square(diff)
            sumsquares = np.sum(diff)
            distances[i][j] = sumsquares

#Here I removed the unnecessary copy induced by calling np.array
# -> some improvement
def sum_diff_v0():
    for i in range(n):
        for j in range(n):
            diff = features[i] - features[j]
            diff = np.square(diff)
            sumsquares = np.sum(diff)
            distances[i][j] = sumsquares

#Collapsing of the statements -> no improvement
def sum_diff_v1():
    for i in range(n):
        for j in range(n):
            distances[i][j] = np.sum(np.square(features[i] - features[j]))

# Using brodcasting and vetorized operations -> big improvement
def sum_diff_v2():
    for i in range(n):
        distances[i] = np.sum(np.square(features[i] - features),axis=1)

# Computing only half the distance -> 1/2 computation time
def sum_diff_v3():
    for i in range(n):
        distances[i][i+1:] = np.sum(np.square(features[i] - features[i+1:]),axis=1)
    distances[:] = distances + distances.T

print("original :",timeit.timeit(sum_diff, number=10))
print("v0 :",timeit.timeit(sum_diff_v0, number=10))
print("v1 :",timeit.timeit(sum_diff_v1, number=10))
print("v2 :",timeit.timeit(sum_diff_v2, number=10))
print("v3 :",timeit.timeit(sum_diff_v3, number=10))

编辑:为了完整起见,我还计划了Camilleri的快得多的解决方案:

from sklearn.metrics import pairwise

def Camilleri_solution():
    distances=pairwise.pairwise_distances(features)

计时结果(以秒为单位,功能以1000 * 300输入运行10次):

original : 138.36921879299916
v0 : 111.39915344800102
v1 : 117.7582511530054
v2 : 23.702392491002684
v3 : 9.712442981006461
Camilleri's : 0.6131987979897531

因此,您可以看到我们可以通过使用正确的numpy语法轻松获得一个数量级。请注意,只有1/20的数据,该函数在大约一秒钟内运行,所以我希望整个事情在数十分钟内运行,因为scipt在N ^ 2中运行。

答案 2 :(得分:-3)

忘记numpy,这只是自扩展数组的便利解决方案。 使用python列表,它具有非常快的索引访问速度,大约快15倍。 像这样使用它:

features = list(dataframe)
distances = [[None]*17000]*17000

def sum_diff():
    for i in range(17000):
        for j in range(17000):
            for k in range(300):
                diff = features[i][k] - features[j][k]
                diff = diff*diff
                sumsquares = sumsquares + diff
                distances[i][j] = sumsquares

我希望这比你的解决方案更快,只需尝试并提供反馈。