Question

我正在训练LSTM，以便将时间序列数据分为2类（0和1）。我在0类和1类数据所在的驱动器上有大量数据集我试图通过创建Dataset类并将DataLoader围绕它来批量训练LSTM。我必须进行重塑等预处理。这是我的代码

`

class LoadingDataset(Dataset):
  def __init__(self,data_root1,data_root2,file_name):
    self.data_root1=data_root1#Has the path for class1 data
    self.data_root2=data_root2#Has the path for class0 data
    self.fileap1= pd.DataFrame()#Stores class 1 data
    self.fileap0 = pd.DataFrame()#Stores class 0 data
    self.file_name=file_name#List of all the files at data_root1 and data_root2
    self.labs1=None #Will store the class 1 labels
    self.labs0=None #Will store the class 0 labels

  def __len__(self):
    return len(self.fileap1) 

  def __getitem__(self, index):        
    self.fileap1 = pd.read_csv(self.data_root1+self.file_name[index],header=None)#read the csv file for class 1
    self.fileap1=self.fileap1.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
    self.fileap0 = pd.read_csv(self.data_root2+self.file_name[index],header=None)#read the csv file for class 0
    self.fileap0=self.fileap0.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
    self.labs1=np.array([1]*len(self.fileap1)).reshape(-1,1)#create the labels 1 for the csv file
    self.labs0=np.array([0]*len(self.fileap0)).reshape(-1,1)#create the labels 0 for the csv file
    # print(self.fileap1.shape,' ',self.fileap0.shape)
    # print(self.labs1.shape,' ',self.labs0.shape)
    self.fileap1=np.append(self.fileap1,self.fileap0,axis=0)#combine the class 0 and class one data
    self.fileap1 = torch.from_numpy(self.fileap1).float()
    self.labs1=np.append(self.labs1,self.labs0,axis=0)#combine the label0 and label 1 data
    self.labs1 = torch.from_numpy(self.labs1).int()
    # print(self.fileap1.shape,' ',self.fileap0.shape)
    # print(self.labs1.shape,' ',self.labs0.shape)

    return self.fileap1,self.labs1

data_root1 = '/content/gdrive/My Drive/Data/Processed_Data/Folder1/One_'#location of class 1 data
data_root2 = '/content/gdrive/My Drive/Data/Processed_Data/Folder0/Zero_'#location of class 0 data
training_set=LoadingDataset(data_root1,data_root2,train_ind)#train_ind is a list of file names that have to be read from data_root1 and data_root2
training_generator = DataLoader(training_set,batch_size =2,num_workers=4)

for epoch in range(num_epochs):
  model.train()#Setting the model to train mode after eval mode to train for next epoch once the testing for that epoch is finished
  for i, (inputs, targets) in enumerate(train_loader):
    .
    .
    .
    .

` 运行此代码时出现此错误

RuntimeError：回溯（最近一次调用最近）：文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py”，第99行，在_worker_loop中样本= collate_fn（[batch_indices中[i的数据集[i]]）文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py”，行68，在default_collate中返回[default_collate（samples）进行转置的样本] 文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py”，第68行，在返回[default_collate（samples）进行转置的样本] 文件“ /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py”，第43行，位于default_collate 返回torch.stack（batch，0，out = out） RuntimeError：无效参数0：张量的大小必须匹配，但维度0除外。在/pytorch/aten/src/TH/generic/THTensor.cpp:711
的维度1中获得96596和25060。

我的问题是 1.我是否正确实施了此方法，这是您进行预处理然后分批训练数据集的方法吗？

2。DataLoader的batch_size和LSTM的batch_size不同，因为DataLoader的batch_size引用了no。文件的数量，而LSTM模型的batch_size表示否。实例，那么我会在这里得到另一个错误吗？

3。我不知道如何缩放此数据集，因为必须将MinMaxScaler全部应用于数据集。

感谢您的答复。如果需要为每个问题创建单独的帖子，请告诉我。

谢谢。

Answer 1

以下是pytorch的工作方式摘要：

您有一个dataset，这是一个具有__len__方法和__getitem__方法的对象。
您从该dataloader和一个dataset创建一个collate_fn
您遍历dataloader并将一批数据传递给模型。

因此，基本上，您的训练循环将类似于

for x, y in dataloader:
    output = model(x)
...

或

for x, y in dataloader:
        output = model(*x)
    ...

如果您的模型forward方法采用多个参数。

那么这如何工作？基本上，您有一个批处理索引batch_sampler的生成器，这就是数据加载器内部的循环的作用。

for indices in batch_sampler:
    yield collate_fn([dataset[i] for i in indices])

因此，如果您希望一切正常工作，则必须查看模型的forward方法并查看需要多少个参数（根据我的经验，LSTM的正向方法可以有多个参数），并确保您使用collate_fn正确传递了密码。

分批加载庞大的数据集以训练pytorch

1 个答案: