Question

我需要计算许多3x3旋转矩阵的组合。

以下是在functools.reduce和matmul上对numpy应用cupy的比较：

import timeit
from functools import reduce
import numpy as np
import cupy as cp
from pyrr.matrix33 import create_from_axis_rotation

# generate random rotation matrices
axes = np.random.rand(10000, 3)
angles = np.pi * np.random.rand(10000)
rotations = [create_from_axis_rotation(*params) for params in zip(axes, angles)]

# then reduce with matmul

xp = np # numpy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")

xp = cp # cupy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")

在配备Titan GPU的良好机器上，这可以提供：

numpy: 1.63e+02ms
cupy: 8.78e+02ms

由于某种原因，GPU的运行速度慢得多。

无论如何，有没有一种方法可以更快地计算出这个值？

编辑

我找到了一个相当简单的解决方案，该解决方案适用于所有小的线性变换的链（并且可以轻松地扩展为仿射变换）。


def reduce_loop(matrices):
    """ non-optimized reduce """
    mat = matrices[0]
    for _mat in matrices[1:]:
        mat = mat @ _mat
    return mat

def reduce_split(matrices): 
    """ reduce by multiplying pairs of matrices recursively """
    if len(matrices) == 1:
        return matrices[0]
    neven = (len(matrices) // 2) * 2
    reduced = matrices[:neven:2] @ matrices[1:neven:2]
    if len(matrices) > neven:  # len(matrices) is odd
        reduced[-1] = reduced[-1] @ matrices[-1]
    return reduce_split(reduced)

time = timeit.timeit("reduce_loop(rotations)", number=10, globals=globals())
print(f"reduce_loop: {time * 1000:0.3}ms")

time = timeit.timeit("reduce_split(rotations)", number=10, globals=globals())
print(f"reduce_split: {time * 1000:0.3}ms")

给予：

reduce_loop: 2.14e+02ms
reduce_split: 24.5ms

我确定它不是最佳选择，但它使用了numpy（可能还有cupy）的优化。

Answer 1

functools.reduce（）已从核心python中删除，因为它效率低下且不是pythonic。没有cuPy等效项，只有functools库中的主机版本
您的cuPy代码花费了大部分时间，无效率地将数据从主机复制到设备，然后再返回……数千次-因为reduce（）仅在主机上运行，而不在GPU上运行。您正在使用PCI总线，而不是GPU
考虑将列表“旋转”成cuPy矩阵，然后使用跨步（而不是python列表）
使用cuPy归约内核执行matmul https://docs.cupy.dev/en/stable/reference/generated/cupy.ReductionKernel.html

计算许多3x3矩阵-矩阵乘法的最快方法

编辑

1 个答案: