在分割数据集之前,我需要随机加载数据,然后进行分割。这是分割数据集的片段,该数据集不是随机的。我想知道如何针对folder_mask中的图像和相应的遮罩执行此操作?
folder_data = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\imagesResized\\*.png")
folder_mask = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\labelsResized\\*.png")
# split these path using a certain percentage
len_data = len(folder_data)
print("count of dataset: ", len_data)
# count of dataset: 992
split_1 = int(0.6 * len(folder_data))
split_2 = int(0.8 * len(folder_data))
#folder_data.sort()
train_image_paths = folder_data[:split_1]
print("count of train images is: ", len(train_image_paths))
valid_image_paths = folder_data[split_1:split_2]
print("count of validation image is: ", len(valid_image_paths))
test_image_paths = folder_data[split_2:]
print("count of test images is: ", len(test_image_paths))
train_mask_paths = folder_mask[:split_1]
valid_mask_paths = folder_mask[split_1:split_2]
test_mask_paths = folder_mask[split_2:]
train_dataset = CustomDataset(train_image_paths, train_mask_paths)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=1,
shuffle=True, num_workers=2)
valid_dataset = CustomDataset(valid_image_paths, valid_mask_paths)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=1,
shuffle=True, num_workers=2)
test_dataset = CustomDataset(test_image_paths, test_mask_paths)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1,
shuffle=False, num_workers=2)
dataLoaders = {
'train': train_loader,
'valid': valid_loader,
'test': test_loader,
}
答案 0 :(得分:1)
据我了解,您希望随机化图片的顺序,以便每次重新运行时,火车和测试集中都有不同的照片。假设您想在或多或少的普通Python中执行此操作,则可以执行以下操作。
使用洗牌python中的元素列表的最简单方法是:
import random
random.shuffle(list) // shuffles in place
因此,您必须列出并希望仍然保留数据和掩码之间的链接。因此,如果您可以接受较快的破解,我会建议这样的事情。
import random
folder_data = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\imagesResized\\*.png")
folder_mask = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\labelsResized\\*.png")
assert len(folder_data) == len(folder_mask) // everything else would be bad
indices = list(range(len(folder_data)))
random.shuffle(indices)
现在您有了一个可以拆分的索引列表,然后使用拆分后的列表中的索引来访问其他列表。
split_1 = int(0.6 * len(folder_data))
split_2 = int(0.8 * len(folder_data))
train_image_paths = [folder_data[i] for i in indices]
// and so on...
这将是普通的Python方式。但是在sklearn之类的程序包中有执行此操作的功能。因此,您可以考虑使用那些。他们会节省您的大量工作。 (通常,重用代码然后自己实现是更好的方法。)
答案 1 :(得分:0)
尝试使用sklearn.model_selection.train_test_split
。
from sklearn.model_selection import train_test_split
train_image_paths, test_image_paths, train_mask_paths, test_mask_paths
= train_test_split(folder_data, folder_mask, test_size=0.2)
这会将您的数据和标签分为相应的训练和测试集。如果您需要验证集,则可以使用两次-即首先将其分为训练/测试,然后再在训练子集上再次将其分为训练/验证。
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
答案 2 :(得分:0)
如果我做对了,每个样品都有一个口罩,对吗?使用熊猫将数据和掩码配对,然后使用帮助功能将它们随机分割:
import glob
import pandas as pd
import numpy as np
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
m = len(df.index)
train_end = int(train_percent * m)
validate_end = int(validate_percent * m) + train_end
train = df.ix[perm[:train_end]]
validate = df.ix[perm[train_end:validate_end]]
test = df.ix[perm[validate_end:]]
return train, validate, test
folder_data = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\imagesResized\\*.png")
folder_mask = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\labelsResized\\*.png")
data_mask = pd.DataFrame({"data": folder_data, "mask": folder_mask})
train, validate, test = train_validate_test_split(data_mask)
来自this question中的@piRSquared答案的帮助功能的积分