我的数据如下:
folder 1
part0001.csv
part0002.csv
...
part0199.csv
folder 2
part0001.csv
part0002.csv
...
part0199.csv
folder 3
part0001.csv
part0002.csv
...
part0199.csv
更新:
每个.csv
文件约为100Mb
。功能和label
都在同一个.csv
文件中。每个.csv
文件如下。
feat1 feat2 label
1 1 3 0
2 3 4 1
3 2 5 0
...
我想将样本批量加载到.csv
文件中。
答案 0 :(得分:1)
您必须构建一个加载它们的数据集。 (文档:https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)
示例:
import torch
from torch.ults.data import Dataset
import glob2
import pandas as pd
class CustomDataset(Dataset):
def __init__(self, root)
self.root = root
# make a list containing the path to all your csv files
self.paths = glob2.glob('src/**/*.csv)
def __len__(self):
return len(self.paths)
def __getittem__(idx):
data = pd.from_csv(self.paths[idx])
x = data['features']
y = data['labels']
return x, y
这是基本操作,您可以对其进行修改以从每个csv文件中抽样随机样本,或者在训练之前对数据进行预处理。
修改
如果您只是与csv一行,那么您可以做三件事。
实现前两个解决方案的秘密并不多。解决方案3的代码应如下所示:
import torch
from torch.ults.data import Dataset
import glob2
import pandas as pd
class CustomDataset(Dataset):
def __init__(self, root)
self.root = root
# make a list containing the path to all your csv files
self.paths = glob2.glob('src/**/*.csv)
# dict to keep load data in memory:
self.cache = {}
def __len__(self):
return len(self.paths)
def __getittem__(idx):
"""This getittem will load data and save them in memory during training."""
data = cache.get(idx, None)
if data is None:
data = pd.from_csv(self.paths[idx])
try:
# cache data into memory
self.cache{idx: data}
except OSError:
# we may be using too much memory
del self.cache[list(self.cache.keys())[0]]
rnd_idx = np.random.randint(len(data))
x = data['features'][rdn_idx]
y = data['labels'][rdn_idx]
return x, y