使用pytorch和sklearn对MNIST数据集进行交叉验证

时间:2019-11-22 14:25:42

标签: scikit-learn pytorch cross-validation mnist k-fold

我是pytorch的新手,正在尝试实现前馈神经网络来对mnist数据集进行分类。尝试使用交叉验证时遇到一些问题。我的数据具有以下形状:  x_traintorch.Size([45000, 784])y_traintorch.Size([45000])

我尝试使用sklearn中的KFold。

kfold =KFold(n_splits=10)

这是我的训练方法的第一部分,其中我将数据分为折叠部分:

for  train_index, test_index in kfold.split(x_train, y_train): 
        x_train_fold = x_train[train_index]
        x_test_fold = x_test[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_test[test_index]
        print(x_train_fold.shape)
        for epoch in range(epochs):
         ...

y_train_fold变量的索引是正确的,它很简单: [ 0 1 2 ... 4497 4498 4499],但不适用于x_train_fold的{​​{1}}。测试褶皱也是如此。

对于第一次迭代,我希望将变量[ 4500 4501 4502 ... 44997 44998 44999]作为前4500张图片,换句话说,形状为x_train_fold,但形状为torch.Size([4500, 784])

有关如何正确解决此问题的任何提示?

4 个答案:

答案 0 :(得分:2)

您搞砸了索引。

x_train = x[train_index]
x_test = x[test_index]
y_train = y[train_index]
y_test = y[test_index]
    x_fold = x_train[train_index]
    y_fold = y_train[test_index]

应该是:

x_fold = x_train[train_index]
y_fold = y_train[train_index]

答案 1 :(得分:1)

我认为你很困惑!

暂时忽略第二维,当您获得45000点并使用10折交叉验证时,每折的大小是多少? 45000/10,即4500。

这意味着您的每一折将包含4500个数据点,其中一折将用于测试,其余用于训练,即

  

用于测试: 1折=> 4500个数据点=>大小:4500
  用于训练:剩余褶皱=> 45000-4500个数据点=>大小:45000-4500 = 40500

因此,对于第一次迭代,将使用前4500个数据点(与索引相对应)进行测试,其余的将用于训练。 (检查下面的图片)

鉴于您的数据是x_train: torch.Size([45000, 784])y_train: torch.Size([45000]),这就是您的代码的样子:

for train_index, test_index in kfold.split(x_train, y_train):  
    print(train_index, test_index)

    x_train_fold = x_train[train_index] 
    y_train_fold = y_train[train_index] 
    x_test_fold = x_train[test_index] 
    y_test_fold = y_train[test_index] 

    print(x_train_fold.shape, y_train_fold.shape) 
    print(x_test_fold.shape, y_test_fold.shape) 
    break 

[ 4500  4501  4502 ... 44997 44998 44999] [   0    1    2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])

所以,当你说

  

我希望变量x_train_fold是第一个4500张图片... shape torch.Size([4500,784])。

你错了。此大小对应于x_test_fold。在第一次迭代中,基于10倍,x_train_fold将获得40500点,因此其大小应为torch.Size([40500, 784])

K-fold validation image

答案 2 :(得分:0)

认为我现在就拥有它,但是我觉得代码有点混乱,有3个嵌套循环。有没有更简单的方法,或者这种方法还可以吗?

这是我的交叉验证培训代码:

def train(network, epochs, save_Model = False):
    total_acc = 0
    for fold, (train_index, test_index) in enumerate(kfold.split(x_train, y_train)):
        ### Dividing data into folds
        x_train_fold = x_train[train_index]
        x_test_fold = x_train[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_train[test_index]

        train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
        test = torch.utils.data.TensorDataset(x_test_fold, y_test_fold)
        train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = False)
        test_loader = torch.utils.data.DataLoader(test, batch_size = batch_size, shuffle = False)

        for epoch in range(epochs):
            print('\nEpoch {} / {} \nFold number {} / {}'.format(epoch + 1, epochs, fold + 1 , kfold.get_n_splits()))
            correct = 0
            network.train()
            for batch_index, (x_batch, y_batch) in enumerate(train_loader):
                optimizer.zero_grad()
                out = network(x_batch)
                loss = loss_f(out, y_batch)
                loss.backward()
                optimizer.step()
                pred = torch.max(out.data, dim=1)[1]
                correct += (pred == y_batch).sum()
                if (batch_index + 1) % 32 == 0:
                    print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
                        (batch_index + 1)*len(x_batch), len(train_loader.dataset),
                        100.*batch_index / len(train_loader), loss.data, float(correct*100) / float(batch_size*(batch_index+1))))
        total_acc += float(correct*100) / float(batch_size*(batch_index+1))
    total_acc = (total_acc / kfold.get_n_splits())
    print('\n\nTotal accuracy cross validation: {:.3f}%'.format(total_acc))

答案 3 :(得分:0)

尽管以上所有答案都提供了如何拆分数据集的一个很好的示例,但我对实现 K 折交叉验证的方式感到好奇。 K-fold 旨在估计机器学习模型对看不见的数据的技能。使用有限的样本来估计模型在用于对模型训练期间未使用的数据进行预测时的总体预期表现。 (参见维基百科中的概念和解释https://en.wikipedia.org/wiki/Cross-validation_(statistics))因此,有必要在每次折叠开始时初始化您的待训练模型的参数。否则,您的模型将在 K 折后看到数据集中的每个样本,并且没有验证(都是训练样本)。