我在NumPy矩阵向量乘法上遇到了一些(神秘的)性能问题。
我编写了以下代码段来测试矩阵向量乘法的速度:
import timeit
for i in range(90, 101):
tm = timeit.repeat('np.matmul(a, b)', number = 10000,
setup = 'import numpy as np; a, b = np.random.rand({0},{0}), np.random.rand({0})'.format(i))
print(i, sum(tm) / 5)
在某些机器上,结果是正常的:
90 0.08936462279998522
91 0.08872119059979014
92 0.09083068459967762
93 0.09311594780047017
94 0.09907015420012613
95 0.10136517100036144
96 0.10339414420013782
97 0.10627872140012187
98 0.1102267580001353
99 0.11277738099979615
100 0.11471197419996315
在某些机器上,乘法速度减慢到96:
90 0.03618830284103751
91 0.03737151022069156
92 0.03295294055715203
93 0.02851409767754376
94 0.02677299762144685
95 0.028137388220056892
96 0.1916038074065
97 0.16719966367818415
98 0.18511182265356182
99 0.1806833743583411
100 0.17172936061397195
有些甚至减慢了1000倍:
90 0.04183819475583732
91 0.029678784403949977
92 0.02486871089786291
93 0.02882006801664829
94 0.028613184532150625
95 0.02956576123833656
96 31.16711748293601
97 27.803299666382372
98 31.368976181373
99 27.71114011341706
100 26.219610543036833
我测试的所有机器上的Python / NumPy版本都相同(3.7.2 / 1.16.2)。操作系统也相同(Arch Linux)。
可能的原因是什么?为什么会在96号尺寸上发生这种情况?
答案 0 :(得分:1)
在96岁时,您的测试遇到了一些软件/硬件问题:96 * 96 * 96 = 884,736。接近1M并乘以8个字节,以得出浮点数:7,077,888。英特尔i5处理器具有6 MB的L3缓存。我的iMac具有这种类型的处理器,并且在96大小的情况下存在此速度减慢的问题。英特尔®酷睿™i5-7200U处理器具有3 MB的L3缓存,并且没有此问题。因此,可能是软件算法无法正确处理6 MB的缓存大小。
答案 1 :(得分:1)
我认为我终于对为什么有了正确的答案和解释:
我写了更长的程序来测试我的Widows和macOS计算机。看起来3.7版中的NumPy开始在我计算机上的所有四个逻辑处理器中运行matmul函数。我在3.8.02a中看不到这个:
$ python3.8 numpy_matmul.py $ python3.7 numpy_matmul.py
Python version : 3.8.0a2 Python version : 3.7.2
build:('v3.8.0a2:23f4589b4b', build:('v3.7.2:9a3ffc0492',
Feb 25 2019 10:59:08') 'Dec 24 2018 02:44:43')
compiler: compiler:
Clang 6.0 (clang-600.0.57) Clang 6.0 (clang-600.0.57)
Tested by Python code only : Tested by Python code only :
90 time = 0.1132 cpu = 0.1100 90 time = 0.1535 cpu = 0.1236
91 time = 0.1133 cpu = 0.1130 91 time = 0.1264 cpu = 0.1263
92 time = 0.1079 cpu = 0.1077 92 time = 0.1089 cpu = 0.1087
93 time = 0.1146 cpu = 0.1145 93 time = 0.1226 cpu = 0.1224
94 time = 0.1176 cpu = 0.1174 94 time = 0.1273 cpu = 0.1271
95 time = 0.1216 cpu = 0.1215 95 time = 0.1372 cpu = 0.1371
96 time = 0.1115 cpu = 0.1114 96 time = 0.2854 cpu = 0.8933
97 time = 0.1231 cpu = 0.1229 97 time = 0.2887 cpu = 0.9033
98 time = 0.1174 cpu = 0.1173 98 time = 0.2836 cpu = 0.8963
99 time = 0.1330 cpu = 0.1301 99 time = 0.3100 cpu = 0.9108
100 time = 0.1130 cpu = 0.1128 100 time = 0.3149 cpu = 0.9087
Tested with timeit.repeat : Tested with timeit.repeat :
90 time = 0.1060 cpu = 0.1066 90 time = 0.1238 cpu = 0.3264
91 time = 0.1091 cpu = 0.1097 91 time = 0.1233 cpu = 0.1240
92 time = 0.1021 cpu = 0.1027 92 time = 0.1138 cpu = 0.1128
93 time = 0.1149 cpu = 0.1156 93 time = 0.1324 cpu = 0.1327
94 time = 0.1135 cpu = 0.1139 94 time = 0.1319 cpu = 0.1326
95 time = 0.1170 cpu = 0.1177 95 time = 0.1325 cpu = 0.1331
96 time = 0.1069 cpu = 0.1076 96 time = 0.2879 cpu = 0.8886
97 time = 0.1192 cpu = 0.1198 97 time = 0.2867 cpu = 0.8986
98 time = 0.1151 cpu = 0.1155 98 time = 0.3034 cpu = 0.8854
99 time = 0.1200 cpu = 0.1207 99 time = 0.2867 cpu = 0.8966
100 time = 0.1146 cpu = 0.1153 100 time = 0.2901 cpu = 0.9018
这里是numpy_matmul.py:
import time
import timeit
import numpy as np
import platform
def correct_cpu(cpu_time):
pv1, pv2, _ = platform.python_version_tuple()
pcv = platform.python_compiler()
if pv1 == '3' and '5' <= pv2 <= '8' and pcv =='Clang 6.0 (clang-600.0.57)':
cpu_time /= 2.0
return cpu_time
def test(func, n, name):
print('\nTested %s :' % name)
for i in range(90, 101):
t = time.perf_counter()
c = time.process_time()
tm = func(i, n)
t = time.perf_counter() - t
c = correct_cpu(time.process_time() - c)
st = t if tm <= 0.0 else tm
print('%3d time = %.4f cpu = %.4f' % (i, st, c))
if abs(t-st)/st > 0.02:
print(' time!= %.4f' % t)
def test1(i, n):
a, b = np.random.rand(i, i), np.random.rand(i)
for _ in range(n):
np.matmul(a, b)
return 0.0
def test2(i, n):
s = 'import numpy as np;' + \
'a, b = np.random.rand({0},{0}), np.random.rand({0})'
s = s.format(i)
r = 'np.matmul(a, b)'
t = timeit.repeat(stmt=r, setup=s, number=n)
return sum(t)
def test3(i, n):
s = 'import numpy as np;' + \
'a, b = np.random.rand({0},{0}), np.random.rand({0})'
s = s.format(i)
r = 'np.matmul(a, b)'
return timeit.timeit(stmt=r, setup=s, number=n)
print('Python version :', platform.python_version())
print(' build :', platform.python_build())
print(' compiler :', platform.python_compiler())
num = 10000
test(test1, 5 * num, 'by Python code only')
test(test2, num, 'with timeit.repeat')
test(test3, 5 * num, 'with timeit.timeit')