需要一种正确的方法来为文本数据实现自定义数据生成器。或现有的用于文本数据生成器的替代库。
我已经创建了一个自定义数据生成器,该数据生成器根据批次大小生成数据
def a_data_generator(inputXPath,inputYPath, batch_size):
eng = open(inputXPath,"r")
french = open(inputYPath,"r")
while True:
X = []
y = []
while len(X)< batch_size:
lineX = eng.readline()
lineY = french.readline()
if lineX =="":
lineX.seek(0)
lineY.seek(0)
lineX = eng.readline()
lineY = french.readline()
if mode == "eval":
break
X.append(lineX.strip().lower())
y.append(lineY.strip().lower())
testX = encode_sequence(eng_tokenizer, eng_length, X)
testY = encode_sequence(frn_tokenizer, hin_length, y)
testY = encode_output(testY, hin_vocab_size)
print('Train X shape',trainX.shape)
print('Train y shape',trainY.shape)
yield(trainX , trainY)
即使批量大小为1,我也收到内存错误。