对于大量矩阵,我需要计算定义为:
的距离度量虽然我确实知道强烈不鼓励矩阵求逆,但我没有看到解决方法。因此,我试图通过对矩阵求逆进行硬编码来提高性能,因为所有矩阵的大小都是(3,3)。
我预计它至少会有微小的改进,但事实并非如此。
为什么numpy.linalg.inv比这个硬编码矩阵反转更快/更高效?
此外,我还有哪些方法可以改善这个瓶颈?
def inversion(m):
m1, m2, m3, m4, m5, m6, m7, m8, m9 = m.flatten()
determinant = m1*m5*m9 + m4*m8*m3 + m7*m2*m6 - m1*m6*m8 - m3*m5*m7 - m2*m4*m9
return np.array([[m5*m9-m6*m8, m3*m8-m2*m9, m2*m6-m3*m5],
[m6*m7-m4*m9, m1*m9-m3*m7, m3*m4-m1*m6],
[m4*m8-m5*m7, m2*m7-m1*m8, m1*m5-m2*m4]])/determinant
与随机3 * 3矩阵进行时序比较:
%timeit np.linalg.inv(a)
100000个循环,最佳3:每循环12.5μs
%timeit inversion(a)
100000个循环,最佳3:每循环13.9μs
密切相关但完全不重复的是post代码审查,它解释了背景和整个功能。
编辑:正如@Divakar在评论中建议的那样,m.ravel()而不是m.flatten()正在改进反转,以便时序比较现在产生:
numpy - 100000个循环,最佳3:每循环12.6μs
硬编码 - 100000个循环,最佳3:每循环12.8μs
虽然差距正在缩小,但硬编码的差距仍然较慢。怎么样?
答案 0 :(得分:2)
这是一个简单的优化,节省了9次乘法和3次减法
/
你可以通过一次完成整个跟踪来挤出更多的操作(如果我正确计数,则再进行24次乘法运算):
def inversion(m):
m1, m2, m3, m4, m5, m6, m7, m8, m9 = m.ravel()
inv = np.array([[m5*m9-m6*m8, m3*m8-m2*m9, m2*m6-m3*m5],
[m6*m7-m4*m9, m1*m9-m3*m7, m3*m4-m1*m6],
[m4*m8-m5*m7, m2*m7-m1*m8, m1*m5-m2*m4]])
return inv / np.dot(inv[0], m[:, 0])
好的是内联版本:
def det(m):
m1, m2, m3, m4, m5, m6, m7, m8, m9 = m.ravel()
return np.dot(m[:, 0], [m5*m9-m6*m8, m3*m8-m2*m9, m2*m6-m3*m5])
# or try m1*(m5*m9-m6*m8) + m4*(m3*m8-m2*m9) + m7*(m2*m6-m3*m5)
# probably the fastest would be to inline the two calls to det
# I'm not doing it here because of readability but you should try it
def dist(m, n):
m1, m2, m3, m4, m5, m6, m7, m8, m9 = m.ravel()
n1, n2, n3, n4, n5, n6, n7, n8, n9 = n.ravel()
return 0.5 * np.dot(
m.ravel()/det(m) + n.ravel()/det(n),
[m5*n9-m6*n8, m6*n7-m4*n9, m4*n8-m5*n7, n3*m8-n2*m9, n1*m9-n3*m7,
n2*m7-n1*m8, m2*n6-m3*n5, m3*n4-m1*n6, m1*n5-m2*n4])
打印:
import numpy as np
from timeit import timeit
def dist(m, n):
m1, m2, m3, m4, m5, m6, m7, m8, m9 = m.ravel()
n1, n2, n3, n4, n5, n6, n7, n8, n9 = n.ravel()
return 0.5 * np.dot(
m.ravel()/(m1*(m5*m9-m6*m8) + m4*(m3*m8-m2*m9) + m7*(m2*m6-m3*m5))
+ n.ravel()/(n1*(n5*n9-n6*n8) + n4*(n3*n8-n2*n9) + n7*(n2*n6-n3*n5)),
[m5*n9-m6*n8, m6*n7-m4*n9, m4*n8-m5*n7, n3*m8-n2*m9, n1*m9-n3*m7,
n2*m7-n1*m8, m2*n6-m3*n5, m3*n4-m1*n6, m1*n5-m2*n4])
def dist_np(m, n):
return 0.5 * np.diag(np.linalg.inv(m)@n + np.linalg.inv(n)@m).sum()
for i in range(3):
A, B = np.random.random((2,3,3))
print(dist(A, B), dist_np(A, B))
print('pp ', timeit('f(A,B)', number=10000, globals={'f':dist, 'A':A, 'B':B}))
print('numpy ', timeit('f(A,B)', number=10000, globals={'f':dist_np, 'A':A, 'B':B}))
请注意,您可以使用该函数的矢量化版本通过批处理进行另一次实质性保存。该测试计算两批100个矩阵之间的所有10,000个成对距离:
2.20109953156 2.20109953156
pp 0.13215381593909115
numpy 0.4334693900309503
7.50799877993 7.50799877993
pp 0.13934064202476293
numpy 0.32861811900511384
-0.780284449609 -0.780284449609
pp 0.1258618349675089
numpy 0.3110764700686559
打印:
def dist(m, n):
m = np.moveaxis(np.reshape(m, m.shape[:-2] + (-1,)), -1, 0)
n = np.moveaxis(np.reshape(n, n.shape[:-2] + (-1,)), -1, 0)
m1, m2, m3, m4, m5, m6, m7, m8, m9 = m
n1, n2, n3, n4, n5, n6, n7, n8, n9 = n
return 0.5 * np.einsum("i...,i...->...",
m/(m1*(m5*m9-m6*m8) + m4*(m3*m8-m2*m9) + m7*(m2*m6-m3*m5))
+ n/(n1*(n5*n9-n6*n8) + n4*(n3*n8-n2*n9) + n7*(n2*n6-n3*n5)),
[m5*n9-m6*n8, m6*n7-m4*n9, m4*n8-m5*n7, n3*m8-n2*m9, n1*m9-n3*m7,
n2*m7-n1*m8, m2*n6-m3*n5, m3*n4-m1*n6, m1*n5-m2*n4])
def dist_np(m, n):
return 0.5 * (np.linalg.inv(m)@n + np.linalg.inv(n)@m)[..., np.arange(3), np.arange(3)].sum(axis=-1)
for i in range(3):
A = np.random.random((100,1,3,3))
B = np.random.random((1,100,3,3))
print(np.allclose(dist(A, B), dist_np(A, B)))
print('pp ', timeit('f(A,B)', number=100, globals={'f':dist, 'A':A, 'B':B}))
print('numpy ', timeit('f(A,B)', number=100, globals={'f':dist_np, 'A':A, 'B':B}))
答案 1 :(得分:2)
我想在你调用np.array()
时创建四个Python对象(四个列表)的开销很小。
我创建了以下文件(test.py
):
import numpy as np
def one():
return np.array([[0.1,0.2,0.3],[0.4,0.5,0.6],[0.7,0.8,0.9]])
def two():
a = np.zeros((3, 3))
a[0,0]=0.1
a[0,1]=0.2
a[0,2]=0.3
a[1,0]=0.4
a[1,1]=0.5
a[1,2]=0.6
a[2,0]=0.7
a[2,1]=0.8
a[2,2]=0.9
return a
one()
和two()
都在做同样的事情。但是,进程中的one()
会创建四个Python列表,two()
则不会。现在:
$ python -m timeit -s 'from test import one' 'one()'
100000 loops, best of 3: 3.13 usec per loop
$ python -m timeit -s 'from test import one' 'one()'
100000 loops, best of 3: 2.95 usec per loop
$ python -m timeit -s 'from test import one' 'one()'
100000 loops, best of 3: 3 usec per loop
$ python -m timeit -s 'from test import two' 'two()'
1000000 loops, best of 3: 1.61 usec per loop
$ python -m timeit -s 'from test import two' 'two()'
1000000 loops, best of 3: 1.76 usec per loop
$ python -m timeit -s 'from test import two' 'two()'
1000000 loops, best of 3: 1.69 usec per loop
我也尝试过使用元组而不是列表,结果如预期的那样(比没有新的Python对象慢,但比列表更快,因为元组是不可修改的,而且开销可能更小)
def three():
return np.array(((0.1, 0.2, 0.3),(0.4,0.5,0.6),(0.7,0.8,0.9)))
$ python -m timeit -s 'from test import three' 'three()'
100000 loops, best of 3: 2.11 usec per loop
$ python -m timeit -s 'from test import three' 'three()'
100000 loops, best of 3: 2.03 usec per loop
$ python -m timeit -s 'from test import three' 'three()'
100000 loops, best of 3: 2.08 usec per loop