Question

我需要将3个矩阵相乘，A: 3000x100, B: 100x100, C: 100x3.6MM。我目前在PyTorch中仅使用普通矩阵乘法

A_gpu = torch.from_numpy(A)
B_gpu = torch.from_numpy(B)
C_gpu = torch.from_numpy(C)
D_gpu = (A_gpu @ B_gpu @ C_gpu.t()).t()

C非常宽，因此gpu上的数据重用受到限制，但是还有其他方法可以加快速度吗？我有一台配备4个GPU的机器。

Answer 1

根据矩阵C，稀疏矩阵可能会减少大小和计算时间，例如您只保存了不为0的列，也许torch reference可能会有所帮助。

Answer 2

如果您有多个GPU，则可以使用PyTorch的{{3}}在所有GPU上分配计算。它将在GPU之间拆分（并行化）矩阵C_gpu的列的乘法。

方法如下：

首先，导入模块并准备矩阵：

import torch
import torch.nn as nn

A_gpu = torch.from_numpy(A).float()
B_gpu = torch.from_numpy(B).float()
C_gpu = torch.from_numpy(C).float()

接下来，创建一个没有偏差的DataParallel“图层”。该层所做的正是矩阵乘法。输入大小将是C_gpu每一列的大小，输出大小将是结果每一列的大小。

mat_mult = nn.Linear(in_features=C_gpu.shape[0],out_features=A_gpu.shape[0],bias=False)

将图层的矩阵（= weight）设置为A_gpu @ B_gpu，这是一个很小的矩阵，可以在不进行并行化的情况下快速进行计算（尽管您也可以将其并行化）。

mat_mult.weight.data = A_gpu @ B_gpu

将图层转换为DataParallel实例。这意味着它将沿“批”维度自动并行化计算。参数device_ids是GPU索引的列表（在您的情况下为其中的4个）。

mat_mult_gpu = nn.DataParallel(mat_mult,device_ids=[0,1,2,3]).to('cuda:0')

现在您可以将矩阵C_gpu送入图层，并且计算将沿其大尺寸方向平行：

D_gpu  = mat_mult_gpu(C_gpu.t())

重要提示：：在编写此答案时，我无权访问多个GPU来实际测试此建议的解决方案。我将不胜感激，如果有任何读者能够确认它确实有效（甚至更好-将该解决方案定为时间并与单个GPU进行比较）

EDIT1 ：我现在在多个GPU（四个Nvidia Tesla P100）上尝试了此代码，结果显示出内存不足错误。我将在此处保留此解决方案作为参考，因为它确实适用于最大约400K（而不是3.6M）的大小。

此外，如果将C分成较小的块，将每个块都放入mat_mult_gpu，然后将结果串联在CPU上，则此解决方案仍适用于3.6M大小。请注意，此操作需要大量CPU内存，因为结果的大小为3K-by-3.6M，在fp32中大约需要40GB。（或者，您可以将每个块都保存到磁盘中，而无需串联块）。

Answer 3

由于具有四个GPU，因此可以利用它们执行高效的矩阵乘法。但是请注意，乘法结果的大小为3000x3600000，在单精度浮点（fp32）中占用40GB。除非您有足够大的RAM用于CPU，否则无法将计算结果存储在RAM中。

为此可能的解决方案是将大矩阵C分成四个较小的块，在不同的GPU上对每个块执行矩阵乘法，然后将结果保留在GPU上。假设每个GPU至少有10GB的内存，那么您将有足够的内存。

如果您确实还有足够的CPU内存，则可以将所有四个GPU的结果移至CPU上并进行串联（实际上，在这种情况下，您可能只使用了一个GPU并将结果从GPU传输到每次使用CPU）。否则，您可以将结果保留在GPU上，然后需要记住并跟踪这四个块实际上是一个矩阵的一部分。

import numpy as np
import torch.nn as nn
import torch

number_of_gpus = 4

# create four matrics
A = np.random.normal(size=(3000,100))
B = np.random.normal(size=(100,100))
C = np.random.normal(size=(100,3600000))

# convert them to pytorch fp32 tensors
A = torch.from_numpy(A).float()
B = torch.from_numpy(B).float()
C = torch.from_numpy(C).float()

# calcualte `A@B`, which is easy
AB = A@B

# split the large matrix `C` into 4 smaller chunks along the second dimension. 
# we assume here that the size of the second dimension of `C` is divisible by 4.  
C_split = torch.split(C,C.shape[1]//number_of_gpus,dim=1)

# loop over the four GPUs, and perform the calculation on each using the corresponding chunk of `C`
D_split = []
for i in range(number_of_gpus):
    device = 'cuda:{:d}'.format(i)
    D_split.append( AB.to(device) @ C_split[i].to(device))

# DO THIS ONLY IF YOU HAVE ENOUGH CPU MEMORY!! :
D = torch.cat([d.cpu() for d in D_split],dim=1)

当一个矩阵非常宽时，实现矩阵乘法的有效方法？

3 个答案: