Question

我在NumPy矩阵向量乘法上遇到了一些（神秘的）性能问题。

我编写了以下代码段来测试矩阵向量乘法的速度：

import timeit
for i in range(90, 101):
    tm = timeit.repeat('np.matmul(a, b)', number = 10000,
        setup = 'import numpy as np; a, b = np.random.rand({0},{0}), np.random.rand({0})'.format(i))
    print(i, sum(tm) / 5)

在某些机器上，结果是正常的：

90 0.08936462279998522
91 0.08872119059979014
92 0.09083068459967762
93 0.09311594780047017
94 0.09907015420012613
95 0.10136517100036144
96 0.10339414420013782
97 0.10627872140012187
98 0.1102267580001353
99 0.11277738099979615
100 0.11471197419996315

在某些机器上，乘法速度减慢到96：

90 0.03618830284103751
91 0.03737151022069156
92 0.03295294055715203
93 0.02851409767754376
94 0.02677299762144685
95 0.028137388220056892
96 0.1916038074065
97 0.16719966367818415
98 0.18511182265356182
99 0.1806833743583411
100 0.17172936061397195

有些甚至减慢了1000倍：

90 0.04183819475583732
91 0.029678784403949977
92 0.02486871089786291
93 0.02882006801664829
94 0.028613184532150625
95 0.02956576123833656
96 31.16711748293601
97 27.803299666382372
98 31.368976181373
99 27.71114011341706
100 26.219610543036833

我测试的所有机器上的Python / NumPy版本都相同（3.7.2 / 1.16.2）。操作系统也相同（Arch Linux）。

可能的原因是什么？为什么会在96号尺寸上发生这种情况？

Answer 1

在96岁时，您的测试遇到了一些软件/硬件问题：96 * 96 * 96 = 884,736。接近1M并乘以8个字节，以得出浮点数：7,077,888。英特尔i5处理器具有6 MB的L3缓存。我的iMac具有这种类型的处理器，并且在96大小的情况下存在此速度减慢的问题。英特尔®酷睿™i5-7200U处理器具有3 MB的L3缓存，并且没有此问题。因此，可能是软件算法无法正确处理6 MB的缓存大小。

Answer 2

我认为我终于对为什么有了正确的答案和解释：

此问题已在Python版本3.8.0a2（当前的预发行测试版本）中得以解决
该问题存在于Windows和macOS上的Python v 3.7.2（最新版本）中。

我写了更长的程序来测试我的Widows和macOS计算机。看起来3.7版中的NumPy开始在我计算机上的所有四个逻辑处理器中运行matmul函数。我在3.8.02a中看不到这个：

$ python3.8 numpy_matmul.py       $ python3.7 numpy_matmul.py     

Python version  : 3.8.0a2         Python version  : 3.7.2         
  build:('v3.8.0a2:23f4589b4b',    build:('v3.7.2:9a3ffc0492',
        Feb 25 2019 10:59:08')          'Dec 24 2018 02:44:43')
  compiler:                        compiler:
     Clang 6.0 (clang-600.0.57)   Clang 6.0 (clang-600.0.57) 

Tested by Python code only :      Tested by Python code only :  
 90 time = 0.1132 cpu = 0.1100     90 time = 0.1535 cpu = 0.1236
 91 time = 0.1133 cpu = 0.1130     91 time = 0.1264 cpu = 0.1263
 92 time = 0.1079 cpu = 0.1077     92 time = 0.1089 cpu = 0.1087
 93 time = 0.1146 cpu = 0.1145     93 time = 0.1226 cpu = 0.1224
 94 time = 0.1176 cpu = 0.1174     94 time = 0.1273 cpu = 0.1271
 95 time = 0.1216 cpu = 0.1215     95 time = 0.1372 cpu = 0.1371
 96 time = 0.1115 cpu = 0.1114     96 time = 0.2854 cpu = 0.8933
 97 time = 0.1231 cpu = 0.1229     97 time = 0.2887 cpu = 0.9033
 98 time = 0.1174 cpu = 0.1173     98 time = 0.2836 cpu = 0.8963
 99 time = 0.1330 cpu = 0.1301     99 time = 0.3100 cpu = 0.9108
100 time = 0.1130 cpu = 0.1128    100 time = 0.3149 cpu = 0.9087

Tested with timeit.repeat :       Tested with timeit.repeat :   
 90 time = 0.1060 cpu = 0.1066     90 time = 0.1238 cpu = 0.3264
 91 time = 0.1091 cpu = 0.1097     91 time = 0.1233 cpu = 0.1240
 92 time = 0.1021 cpu = 0.1027     92 time = 0.1138 cpu = 0.1128
 93 time = 0.1149 cpu = 0.1156     93 time = 0.1324 cpu = 0.1327
 94 time = 0.1135 cpu = 0.1139     94 time = 0.1319 cpu = 0.1326
 95 time = 0.1170 cpu = 0.1177     95 time = 0.1325 cpu = 0.1331
 96 time = 0.1069 cpu = 0.1076     96 time = 0.2879 cpu = 0.8886
 97 time = 0.1192 cpu = 0.1198     97 time = 0.2867 cpu = 0.8986
 98 time = 0.1151 cpu = 0.1155     98 time = 0.3034 cpu = 0.8854
 99 time = 0.1200 cpu = 0.1207     99 time = 0.2867 cpu = 0.8966
100 time = 0.1146 cpu = 0.1153    100 time = 0.2901 cpu = 0.9018

这里是numpy_matmul.py：

import time
import timeit
import numpy as np
import platform


def correct_cpu(cpu_time):
    pv1, pv2, _ = platform.python_version_tuple()
    pcv = platform.python_compiler()
    if pv1 == '3' and '5' <= pv2 <= '8' and pcv =='Clang 6.0 (clang-600.0.57)':
        cpu_time /= 2.0
    return cpu_time


def test(func, n, name):
    print('\nTested %s :' % name)
    for i in range(90, 101):
        t = time.perf_counter()
        c = time.process_time()
        tm = func(i, n)
        t = time.perf_counter() - t
        c = correct_cpu(time.process_time() - c)
        st = t if tm <= 0.0 else tm
        print('%3d time = %.4f cpu = %.4f' % (i, st, c))
        if abs(t-st)/st > 0.02:
            print('    time!= %.4f' % t)


def test1(i, n):
    a, b = np.random.rand(i, i), np.random.rand(i)
    for _ in range(n):
        np.matmul(a, b)
    return 0.0


def test2(i, n):
    s = 'import numpy as np;' + \
        'a, b = np.random.rand({0},{0}), np.random.rand({0})'
    s = s.format(i)
    r = 'np.matmul(a, b)'
    t = timeit.repeat(stmt=r, setup=s, number=n)
    return sum(t)


def test3(i, n):
    s = 'import numpy as np;' + \
        'a, b = np.random.rand({0},{0}), np.random.rand({0})'
    s = s.format(i)
    r = 'np.matmul(a, b)'
    return timeit.timeit(stmt=r, setup=s, number=n)


print('Python version  :', platform.python_version())
print('       build    :', platform.python_build())
print('       compiler :', platform.python_compiler())
num = 10000
test(test1, 5 * num, 'by Python code only')
test(test2, num, 'with timeit.repeat')
test(test3, 5 * num, 'with timeit.timeit')

NumPy矩阵向量乘法的性能下降

2 个答案: