我正在训练LSTM,以便将时间序列数据分为2类(0和1)。我在0类和1类数据所在的驱动器上有大量数据集我试图通过创建Dataset类并将DataLoader围绕它来批量训练LSTM。我必须进行重塑等预处理。这是我的代码
`
class LoadingDataset(Dataset):
def __init__(self,data_root1,data_root2,file_name):
self.data_root1=data_root1#Has the path for class1 data
self.data_root2=data_root2#Has the path for class0 data
self.fileap1= pd.DataFrame()#Stores class 1 data
self.fileap0 = pd.DataFrame()#Stores class 0 data
self.file_name=file_name#List of all the files at data_root1 and data_root2
self.labs1=None #Will store the class 1 labels
self.labs0=None #Will store the class 0 labels
def __len__(self):
return len(self.fileap1)
def __getitem__(self, index):
self.fileap1 = pd.read_csv(self.data_root1+self.file_name[index],header=None)#read the csv file for class 1
self.fileap1=self.fileap1.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
self.fileap0 = pd.read_csv(self.data_root2+self.file_name[index],header=None)#read the csv file for class 0
self.fileap0=self.fileap0.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
self.labs1=np.array([1]*len(self.fileap1)).reshape(-1,1)#create the labels 1 for the csv file
self.labs0=np.array([0]*len(self.fileap0)).reshape(-1,1)#create the labels 0 for the csv file
# print(self.fileap1.shape,' ',self.fileap0.shape)
# print(self.labs1.shape,' ',self.labs0.shape)
self.fileap1=np.append(self.fileap1,self.fileap0,axis=0)#combine the class 0 and class one data
self.fileap1 = torch.from_numpy(self.fileap1).float()
self.labs1=np.append(self.labs1,self.labs0,axis=0)#combine the label0 and label 1 data
self.labs1 = torch.from_numpy(self.labs1).int()
# print(self.fileap1.shape,' ',self.fileap0.shape)
# print(self.labs1.shape,' ',self.labs0.shape)
return self.fileap1,self.labs1
data_root1 = '/content/gdrive/My Drive/Data/Processed_Data/Folder1/One_'#location of class 1 data
data_root2 = '/content/gdrive/My Drive/Data/Processed_Data/Folder0/Zero_'#location of class 0 data
training_set=LoadingDataset(data_root1,data_root2,train_ind)#train_ind is a list of file names that have to be read from data_root1 and data_root2
training_generator = DataLoader(training_set,batch_size =2,num_workers=4)
for epoch in range(num_epochs):
model.train()#Setting the model to train mode after eval mode to train for next epoch once the testing for that epoch is finished
for i, (inputs, targets) in enumerate(train_loader):
.
.
.
.
` 运行此代码时出现此错误
RuntimeError:回溯(最近一次调用最近): 文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py”,第99行,在_worker_loop中 样本= collate_fn([batch_indices中[i的数据集[i]]) 文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py”,行68,在default_collate中 返回[default_collate(samples)进行转置的样本] 文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py”,第68行,在 返回[default_collate(samples)进行转置的样本] 文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py”,第43行,位于default_collate 返回torch.stack(batch,0,out = out) RuntimeError:无效参数0:张量的大小必须匹配,但维度0除外。在/pytorch/aten/src/TH/generic/THTensor.cpp:711
的维度1中获得96596和25060。
我的问题是 1.我是否正确实施了此方法,这是您进行预处理然后分批训练数据集的方法吗?
2。DataLoader的batch_size和LSTM的batch_size不同,因为DataLoader的batch_size引用了no。文件的数量,而LSTM模型的batch_size表示否。实例,那么我会在这里得到另一个错误吗?
3。我不知道如何缩放此数据集,因为必须将MinMaxScaler全部应用于数据集。
感谢您的答复。如果需要为每个问题创建单独的帖子,请告诉我。
谢谢。
答案 0 :(得分:0)
以下是pytorch的工作方式摘要:
dataset
,这是一个具有__len__
方法和__getitem__
方法的对象。dataloader
和一个dataset
创建一个collate_fn
dataloader
并将一批数据传递给模型。因此,基本上,您的训练循环将类似于
for x, y in dataloader:
output = model(x)
...
或
for x, y in dataloader:
output = model(*x)
...
如果您的模型forward
方法采用多个参数。
那么这如何工作?
基本上,您有一个批处理索引batch_sampler
的生成器,这就是数据加载器内部的循环的作用。
for indices in batch_sampler:
yield collate_fn([dataset[i] for i in indices])
因此,如果您希望一切正常工作,则必须查看模型的forward
方法并查看需要多少个参数(根据我的经验,LSTM的正向方法可以有多个参数),并确保您使用collate_fn
正确传递了密码。