RuntimeError:CUDA运行时错误(710):在

时间:2019-12-14 00:28:06

标签: deep-learning pytorch

  

使用pytorch转换图像分类
得到了以下结果   错误消息K


  

RuntimeError跟踪(最近的调用)   最后)        29打印(len(train_loader.dataset),len(valid_loader.dataset))        30#休息   ---> 31 train_loss,train_acc,模型=火车(模型,设备,train_loader,优化器,标准)        32valid_loss,valid_acc,model =评估(模型,设备,valid_loader,条件)        33

     

in train(模型,设备,迭代器,   优化程序,条件)        21个acc = Calculation_accuracy(fx,y)        22 #print(“ 5。”)   ---> 23 loss.backward()        24        25 Optimizer.step()

     

〜/ venv / lib / python3.7 / site-packages / torch / tensor.py向后(自己,   渐变,retain_graph,create_graph)       164个商品。默认为False。       165“”“   -> 166 torch.autograd.backward(自己,渐变,retain_graph,create_graph)       167       168 def register_hook(self,hook):

     

〜/ venv / lib / python3.7 / site-packages / torch / autograd / init .py in   向后(张量,grad_tensors,retain_graph,create_graph,   grad_variables)        97变量。        98张量,grad_tensors,retain_graph,create_graph,   ---> 99 allow_unreachable = True)#allow_unreachable标志       100       101

     

RuntimeError:CUDA运行时错误(710):触发了设备端断言   在/pytorch/aten/src/THC/generic/THCTensorMath.cu:26

相关代码块在这里

def train(model, device, iterator, optimizer, criterion):

print('train')
epoch_loss = 0
epoch_acc = 0

model.train()


for (x, y) in iterator:
    #print(x,y)
    x,y = x.cuda(), y.cuda()
    #x = x.to(device)
    #y = y.to(device)
    #print('1')
    optimizer.zero_grad()
    #print('2')
    fx = model(x)
    #print('3')
    loss = criterion(fx, y)
    #print("4.loss->",loss)
    acc = calculate_accuracy(fx, y)
    #print("5.")
    loss.backward()

    optimizer.step()

    epoch_loss += loss.item()
    epoch_acc += acc.item()

return epoch_loss / len(iterator), epoch_acc / len(iterator),model


    EPOCHS = 5
    SAVE_DIR = 'models'
    MODEL_SAVE_PATH = os.path.join(SAVE_DIR, 'please.pt')
    from torch.utils.data import DataLoader
    best_valid_loss = float('inf')

    if not os.path.isdir(f'{SAVE_DIR}'):
        os.makedirs(f'{SAVE_DIR}')
    print("start")
    for epoch in range(EPOCHS):
        print('================================',epoch ,'================================')
        for i , (train_idx, valid_idx) in enumerate(zip(train_indexes, valid_indexes)):
            print(i,train_idx,valid_idx,len(train_idx),len(valid_idx))

            traindf = df_train.iloc[train_index, :].reset_index()
            validdf = df_train.iloc[valid_index, :].reset_index()

            #traindf = df_train
            #validdf = df_train

            train_dataset = TrainDataset(traindf, mode='train', transforms=data_transforms)
            valid_dataset = TrainDataset(validdf, mode='valid', transforms=data_transforms)

            train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
            valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)



            print(len(train_loader.dataset),len(valid_loader.dataset))
            #break
            train_loss, train_acc ,model= train(model, device, train_loader, optimizer, criterion)
            valid_loss, valid_acc,model = evaluate(model, device, valid_loader, criterion)

            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
                torch.save(model,MODEL_SAVE_PATH)

            print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:05.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:05.2f}% |')
  

拆分= zip(train_indexes,valid_indexes)   [3692 3696 3703 ... 30733 30734 30735] [0 1 2 ... 4028   4041 4046] [0 1 2 ... 30733 30734 30735] [3692 3696 3703   ... 7986 7991 8005] [0 1 2 ... 30733 30734 30735] [7499   7500 7502 ... 11856 11858 11860] [0 1 2 ... 30733 30734   30735] [11239 11274 11280 ... 15711 15716 15720] [0 1 2   ... 30733 30734 30735] [15045 15051 15053 ... 19448 19460 19474] [
  0 1 2 ... 30733 30734 30735] [18919 18920 18926 ... 23392   23400 23402] [0 1 2 ... 30733 30734 30735] [22831 22835   22846 ... 27118 27120 27124] [0 1 2 ... 27118 27120 27124]   [26718 26721 26728 ... 30733 30734 30735]

1 个答案:

答案 0 :(得分:0)

您的损失函数是什么?

我也收到此错误。我的问题是multi-class分类,我损失了crossEntropy

正如在documentations中所说,标签应在[0, C-1]范围内,其中C是类数。 但是我的标签不在此范围内,当我为标签使用适当的值时,一切正常。