假设我有一个模型(主要是伪代码):
class SomeLayer(nn.Module):
def __init__(self, s):
#init some layers etc
self.N = s*s
def forward(self, input_tensor):
#intialize some variables
some_results=[]
for iter_i in range(self.N):
# do independent operations on different parts of input_tensor
# each operation is basically a copy of a subtensor of input_tensor
# such that its size depends on iter_i
# append_results to some_results
return some_results
并行化这种for
循环的正确方法是什么?目前,我正打算为此编写一个小的CUDA内核,并从python加载它,但感觉有些过头了,我认为应该有一种简单的方法来做到这一点,尽管我无法在Windows中找到它。文档。