如何将自定义数据集拆分为训练和测试数据集?

时间:2018-05-26 16:16:05

标签: python deep-learning pytorch

import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path, transform=None):
        self.data = pd.read_csv(csv_path)
        self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
        self.height = 48
        self.width = 48
        self.transform = transform

    def __getitem__(self, index):
        pixels = self.data['pixels'].tolist()
        faces = []
        for pixel_sequence in pixels:
            face = [int(pixel) for pixel in pixel_sequence.split(' ')]
            # print(np.asarray(face).shape)
            face = np.asarray(face).reshape(self.width, self.height)
            face = cv2.resize(face.astype('uint8'), (self.width, self.height))
            faces.append(face.astype('float32'))
        faces = np.asarray(faces)
        faces = np.expand_dims(faces, -1)
        return faces, self.labels

    def __len__(self):
        return len(self.data)

这是我可以通过使用其他存储库的引用来完成的。 但是,我想将此数据集拆分为训练和测试。

我怎么能在这堂课里做到这一点?或者我是否需要单独上课才能做到这一点?

5 个答案:

答案 0 :(得分:41)

从PyTorch 0.4.1开始,您可以使用random_split

train_size = int(0.8 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])

答案 1 :(得分:22)

使用Pytorch的{​​{3}}:

import torch
import numpy as np
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path, transform=None):
        self.data = pd.read_csv(csv_path)
        self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
        self.height = 48
        self.width = 48
        self.transform = transform

    def __getitem__(self, index):
        # This method should return only 1 sample and label 
        # (according to "index"), not the whole dataset
        # So probably something like this for you:
        pixel_sequence = self.data['pixels'][index]
        face = [int(pixel) for pixel in pixel_sequence.split(' ')]
        face = np.asarray(face).reshape(self.width, self.height)
        face = cv2.resize(face.astype('uint8'), (self.width, self.height))
        label = self.labels[index]

        return face, label

    def __len__(self):
        return len(self.labels)


dataset = CustomDatasetFromCSV(my_path)
batch_size = 16
validation_split = .2
shuffle_dataset = True
random_seed= 42

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
                                           sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                                sampler=valid_sampler)

# Usage Example:
num_epochs = 10
for epoch in range(num_epochs):
    # Train:   
    for batch_index, (faces, labels) in enumerate(train_loader):
        # ...

答案 2 :(得分:9)

当前答案会进行随机拆分,这有一个缺点,即不能保证每个类别的样本数量保持平衡。当您希望每个类别的样本数量较少时,这尤其成问题。例如,MNIST有60,000个示例,即每位数6000个。假设您只希望训练集中的每位数字有30个示例。在这种情况下,随机拆分可能会在各班级之间产生不平衡(一个位数比其他位数更多的训练数据)。因此,您要确保每个数字恰好只有30个标签。这称为分层抽样

一种方法是在Pytorch和sample code is here中使用采样器界面。

另一种方法是通过:)破解自己的方法。例如,以下是MNIST的简单实现,其中ds是MNIST数据集,而k是每个类别所需的样本数。

def sampleFromClass(ds, k):
    class_counts = {}
    train_data = []
    train_label = []
    test_data = []
    test_label = []
    for data, label in ds:
        c = label.item()
        class_counts[c] = class_counts.get(c, 0) + 1
        if class_counts[c] <= k:
            train_data.append(data)
            train_label.append(torch.unsqueeze(label, 0))
        else:
            test_data.append(data)
            test_label.append(torch.unsqueeze(label, 0))
    train_data = torch.cat(train_data)
    for ll in train_label:
        print(ll)
    train_label = torch.cat(train_label)
    test_data = torch.cat(test_data)
    test_label = torch.cat(test_label)

    return (TensorDataset(train_data, train_label), 
        TensorDataset(test_data, test_label))

您可以像这样使用此功能:

def main():
    train_ds = datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))
    train_ds, test_ds = sampleFromClass(train_ds, 3)

答案 3 :(得分:0)

自定义数据集在PyTorch中具有特殊含义,但我认为您的意思是任何数据集。 让我们检查一下MNIST数据集(这可能是初学者最著名的数据集)。

import torch, torchvision
import torchvision.datasets as datasets
from torch.utils.data import DataLoader, Dataset, TensorDataset
train_loader = DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.5,), (0.5,))
                             ])),
  batch_size=16, shuffle=False)

print(train_loader.dataset.data.shape)

test_ds =  train_loader.dataset.data[:50000, :, :]
valid_ds =  train_loader.dataset.data[50000:, :, :]
print(test_ds.shape)
print(valid_ds.shape)

test_dst =  train_loader.dataset.targets.data[:50000]
valid_dst =  train_loader.dataset.targets.data[50000:]
print(test_dst, test_dst.shape)
print(valid_dst, valid_dst.shape)

这将超出原始[60000, 28, 28]的大小,然后分割[50000, 28, 28]进行测试,并分割[10000, 28, 28]进行验证:

torch.Size([60000, 28, 28])
torch.Size([50000, 28, 28])
torch.Size([10000, 28, 28])
tensor([5, 0, 4,  ..., 8, 4, 8]) torch.Size([50000])
tensor([3, 8, 6,  ..., 5, 6, 8]) torch.Size([10000])

如果您实际上计划将图像和标签(目标)配对在一起的其他信息

bs = 16
test_dl = DataLoader(TensorDataset(test_ds, test_dst), batch_size=bs, shuffle=True)

for xb, yb in test_dl:
    # Do your work

答案 4 :(得分:0)

如果您想确保您的分组具有平衡的类,您可以使用 train_test_split 中的 sklearn

假设您已将 data 包裹在 custom Dataset object 中:

from torch.utils.data import DataLoader, Subset
from sklearn.model_selection import train_test_split

TEST_SIZE = 0.1
BATCH_SIZE = 64
SEED = 42

# generate indices: instead of the actual data we pass in integers instead
train_indices, test_indices, _, _ = train_test_split(
    range(len(data)),
    data.targets,
    stratify=data.targets,
    test_size=TEST_SIZE,
    random_state=SEED
)

# generate subset based on indices
train_split = Subset(data, test_idx)
test_split = Subset(data, test_idx)

# create batches
train_batches = DataLoader(train_split, batch_size=BATCH_SIZE, shuffle=True)
test_batches = DataLoader(test_split, batch_size=BATCH_SIZE, shuffle=True)