计算许多3x3矩阵-矩阵乘法的最快方法

时间:2020-10-28 17:37:09

标签: numpy gpu matrix-multiplication cupy

我需要计算许多3x3旋转矩阵的组合。

以下是在functools.reducematmul上对numpy应用cupy的比较:

import timeit
from functools import reduce
import numpy as np
import cupy as cp
from pyrr.matrix33 import create_from_axis_rotation

# generate random rotation matrices
axes = np.random.rand(10000, 3)
angles = np.pi * np.random.rand(10000)
rotations = [create_from_axis_rotation(*params) for params in zip(axes, angles)]

# then reduce with matmul

xp = np # numpy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")

xp = cp # cupy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")

在配备Titan GPU的良好机器上,这可以提供:

numpy: 1.63e+02ms
cupy: 8.78e+02ms

由于某种原因,GPU的运行速度慢得多。

无论如何,有没有一种方法可以更快地计算出这个值?

编辑

我找到了一个相当简单的解决方案,该解决方案适用于所有小的线性变换的链(并且可以轻松地扩展为仿射变换)。


def reduce_loop(matrices):
    """ non-optimized reduce """
    mat = matrices[0]
    for _mat in matrices[1:]:
        mat = mat @ _mat
    return mat

def reduce_split(matrices): 
    """ reduce by multiplying pairs of matrices recursively """
    if len(matrices) == 1:
        return matrices[0]
    neven = (len(matrices) // 2) * 2
    reduced = matrices[:neven:2] @ matrices[1:neven:2]
    if len(matrices) > neven:  # len(matrices) is odd
        reduced[-1] = reduced[-1] @ matrices[-1]
    return reduce_split(reduced)

time = timeit.timeit("reduce_loop(rotations)", number=10, globals=globals())
print(f"reduce_loop: {time * 1000:0.3}ms")

time = timeit.timeit("reduce_split(rotations)", number=10, globals=globals())
print(f"reduce_split: {time * 1000:0.3}ms")

给予:

reduce_loop: 2.14e+02ms
reduce_split: 24.5ms

我确定它不是最佳选择,但它使用了numpy(可能还有cupy)的优化。

1 个答案:

答案 0 :(得分:1)

  1. functools.reduce()已从核心python中删除,因为它效率低下且不是pythonic。没有cuPy等效项,只有functools库中的主机版本

  2. 您的cuPy代码花费了大部分时间,无效率地将数据从主机复制到设备,然后再返回……数千次-因为reduce()仅在主机上运行,​​而不在GPU上运行。您正在使用PCI总线,而不是GPU

  3. 考虑将列表“旋转”成cuPy矩阵,然后使用跨步(而不是python列表)

  4. 使用cuPy归约内核执行matmul https://docs.cupy.dev/en/stable/reference/generated/cupy.ReductionKernel.html