Question

我正在训练一个具有 100*100 个隐藏节点、四个输入/一个输出、批大小为 32 的神经网络，我没有看到使用 GPU 与 CPU 的速度提升。我只有有限的数据集（1067 个样本，一开始全部复制到 GPU），但我认为 33 个批次可以并行运行，足以弥补复制到 GPU 的时间。我的数据集是否太小，或者是否存在其他潜在问题？这是我的代码片段：

def train_for_regression(X, T):
    BATCH_SIZE = 32
    n_epochs = 1000
    learning_rate = 0.01
    device = torch.device("cuda:0")
    Xt = torch.from_numpy(X).float().to(device) #Training inputs are 4 * 1067 samples
    Tt = torch.from_numpy(T).float().to(device) #Training outputs are 1 * 1067 samples
    
    nnet = torch.nn.Sequential(torch.nn.Linear(4, 100), 
                               torch.nn.Tanh(), 
                               torch.nn.Linear(100, 100), 
                               torch.nn.Tanh(),
                               torch.nn.Linear(100, 1))
    nnet.to(device)
    mse_f = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(nnet.parameters(), lr=learning_rate)

    for epoch in range(n_epochs):
        for i in range(0, len(Xt), BATCH_SIZE):
            batch_Xt = Xt[i:i+BATCH_SIZE,:]
            batch_Tt = Tt[i:i+BATCH_SIZE,:]
            optimizer.zero_grad()
            Y = nnet(batch_Xt)
            mse = mse_f(Y, batch_Tt)
            mse.backward()
            optimizer.step()
    return nnet

Answer 1

数据到达 GPU 所需的时间很可能会抵消 GPU 的优势。在这种情况下，网络的大小似乎很小，以至于 CPU 应该足够高效，而 GPU 的加速不应该那么大。

此外，GPU 通常用于并行矩阵计算，或者在这种情况下 - 单个批次的数据乘以网络的权重。因此，除非您采取额外的步骤（例如使用额外的库和/或 GPU），否则不应并行处理批次。

GPU 显示没有比 CPU 更快的速度

1 个答案: