我对训练神经网络非常陌生,但愚蠢地尝试实现自己的新颖架构。它与变压器非常相似,因为管道会采用“商品标题”张量,将其通过变压器的编码器一半,然后通过变压器编码器的几种转换和版本。然后将输出求和成单个向量并进行软最大化。我还编写了一个自定义损失函数。
当我绘制每层的平均梯度时,它看起来像this。
(最后一层的梯度很大,其余部分几乎不存在。这是在将梯度剪切到0.25之前;在此之前,最后一层的梯度大约是1e7)
这是我的相关代码:
def forward(self, example):
art_mask = Variable((example.Art[:,:,0] != 0).unsqueeze(-2)).cuda()
Art = Variable(example.Art).cuda()
# To give context to article embeddings, pass through Transformer Encoder block
Art = self.encoder_block(self.position_layer(Art), art_mask).cuda()
Tsf = example.Tsf.cuda().repeat((1,Art.shape[1],1)).cuda()
# Concanetante Tsf to Art
Art = torch.cat((Art, Tsf), dim=2).cuda()
# Convert Art to Ent and construct ent_mask
Ent = Art[example.EntArt[:,:,0], example.EntArt[:,:,1],:].cuda()
ent_mask = (example.EntArt[:,:,0] == -1).unsqueeze(-2).cuda()
# Pass to graph block, alternating layers of Relational Attn and Entity Self Attn
Ent = self.graph_block(Ent, ent_mask, example.RelEnt).cuda()
# Slice and reorder Ent into assets tensor
A = len(self.assets_list)
Ass = torch.full((A,1), -1, dtype=torch.long).cuda()
for i,uri in enumerate(self.assets_list):
if uri in example.AssetIndex:
Ass[i] = example.AssetIndex[uri]
Assets = Ent[Ass, :, :].squeeze(1).cuda()
mask = Ass.unsqueeze(2).repeat(1,Assets.shape[1],Assets.shape[2]).cuda()
Assets = Assets.masked_fill(mask == -1, -1e9).cuda()
Assets = Assets.sum(dim = 1).squeeze(1)
Assets = torch.matmul(Assets, self.W).cuda()
bias = torch.zeros((1)).cuda()
Assets = torch.cat((Assets, bias)).cuda()
return self.softmax(Assets)
"""
prices[i] is normalized closing / opening
:param prices <torch.Tensor(batch_size, len(assets))>
"""
def loss_f(model, XY):
examples, prices = XY
portfolios = torch.stack([model.forward(ex) for ex in examples], dim=0)
prices = Variable(prices)
# safe asset (US Dollars) at prices[:,-1]
prices = torch.cat((prices, torch.ones((prices.shape[0], 1), dtype=torch.float)), dim = 1).cuda()
return -torch.sum(portfolios * prices) / 4 # batch size
具有较大梯度的层是变压器式体系结构末端的LayerNorm层的系数和偏差