我试图在不将整个数据集加载到内存的情况下训练深度学习模型。我的主要问题是,这样做的最佳方法是什么?
似乎HDF5是人们完成此操作的常用方法,这是我首先尝试的方法。但是,当使用pytorch的dataloader类时,它运行非常缓慢。我创建了自己的迭代器,该迭代器的运行速度更快,但是并不是每个批次的数据都是随机的。我试图了解pytorch数据加载器运行缓慢的原因,以及是否有我可以做的事情。
下面是我的代码
首先,我定义了一个数据集类,该类接受HDF5数据集的文件路径。我对这段代码的理解是,只要调用 getitem ,它就会从磁盘读取。
class My_H5Dataset(torch.utils.data.Dataset):
def __init__(self, file_path):
super(My_H5Dataset, self).__init__()
h5_file = h5py.File(file_path , 'r')
self.features = h5_file['features']
self.labels = h5_file['labels']
self.index = h5_file['index']
self.labels_values = h5_file['labels_values']
self.index_values = h5_file['index_values']
def __getitem__(self, index):
return (torch.from_numpy(self.features[index,:]).float(),
torch.from_numpy(self.labels[index,:]).float(),
torch.from_numpy(self.index[index,:]).float(),
torch.from_numpy(np.array(self.labels_values[index])),
torch.from_numpy(np.array(self.index_values[index])))
def __len__(self):
return self.features.shape[0]
然后我将其简单地传递给pytorch数据加载器,如下所示
train_dataset = My_H5Dataset(hdf5_data_folder_train)
train_ms = MySampler(train_dataset)
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size,
sampler=train_ms,num_workers=2)
我的另一种方法是手动定义迭代器。而且运行速度确实更快。但是,我没有注意到HDD和SSD上存储的数据之间的速度差异,这让我担心我缺少某个瓶颈。
def hdf5_loader_generator(dataset, batch_size, as_tensor=True, n_samples = 10000):
"""Given an h5 path to a file that holds the arrays, returns a generator
that can get certain data at a time."""
stop = n_samples
curr_index = start = 0
while 1:
stop_index = min([curr_index + batch_size, stop])
ft, sp, index, sp_values, index_values = dataset[curr_index:stop_index]
curr_index += batch_size
if curr_index >= stop:
curr_index = start
continue
yield ft, sp, index, sp_values, index_values
def hdf5_data_iterator(dataset, batch_size, as_tensor=True):
return iter(hdf5_loader_generator(dataset, batch_size, as_tensor))
我对pytorch的迭代器计时如下:
start = time.time()
for i, data in enumerate(trainloader, 0):
ft, sp, index, sp_values, index_values = data
ft, sp, index = ft.to(device), sp.to(device), index.to(device)
if i > 500:
break
end = time.time()
print(end-start)
我对迭代器的计时如下:
train_dataset = My_H5Dataset(hdf5_data_folder_train)
samples = hdf5_data_iterator(train_dataset, batch_sizes)
start = time.time()
for i, data in enumerate(samples):
ft, sp, index, sp_values, index_values = data
ft, sp, index = ft.to(device), sp.to(device), index.to(device)
if i > 500:
break
end = time.time()
print(end-start)
是否有比我目前尝试的方法更好的方法?
谢谢您的帮助!