Question

我在PyTorch中有一个imageFolder，其中保存有我分类的数据图像。每个文件夹都是类别的名称，文件夹中是该类别的图像。

我已经通过带有随机train_test_split的采样器加载了数据，并拆分了训练和测试数据。但是问题是我的数据分配不好，有些类的图像很多，有些类的图像更少。

为了解决这个问题，我想选择每个班级的20％作为测试数据，其余的作为火车数据

ds = ImageFolder(filePath, transform=transform)
batch_size = 64
validation_split = 0.2

indices = list(range(len(ds))) # indices of the dataset

# TODO: fix spliting
train_indices,test_indices = train_test_split(indices,test_size=0.2) 

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(test_indices)

train_loader = torch.utils.data.DataLoader(ds, batch_size=batch_size, sampler=train_sampler, num_workers=16)
test_loader = torch.utils.data.DataLoader(ds, batch_size=batch_size, sampler=test_sampler, num_workers=16)

我应该如何解决？

Answer 1

根据docs在stratify中使用train_test_split自变量。如果您的标签索引是类似y的数组，请执行以下操作：

train_indices,test_indices = train_test_split(indices, test_size=0.2, stratify=y)

Answer 2

尝试使用StratifiedKFold或StratifiedShuffleSplit。

根据文档：

此交叉验证对象是KFold的变体，它返回分层的折痕。折叠是通过保留每个类别的样本百分比来完成的。

您可以尝试以下方法：

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(ds):
    train = torch.utils.data.Subset(dataset, train_index)
    test = torch.utils.data.Subset(dataset, test_index)
    trainloader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)
    testloader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)

如何根据每个类别中的目标数量在数据集中拆分测试和训练数据

2 个答案: