我为包含Keras的mat文件的大型数据集编写数据生成器。
这是我的代码,我尝试解决3个类的问题,他们的数据在不同的文件夹(一,二,三),每个批次将从这个文件夹中随机填写。
def generate_arrays_from_file(path,nc1,nc2,nc3):
while True:
for line in range(batch_size):
Data,y=fetch_data(path,nc1,nc2,nc3)
yield (Data, y)
def fetch_data(path,nc1,nc2,nc3):
trainData = numpy.empty(shape=[batch_size,img_rows, img_cols ])
y = []
for line in range(batch_size):
labelClass = random.randint(0, 2)
if labelClass == 0:
random_num = random.randint(1, nc1)
file_name = path + '/' + 'one/one-' + str(random_num) + '.mat'
elif labelClass == 1:
random_num = random.randint(1, nc2)
file_name = path + '/' + 'two/two-' + str(random_num) + '.mat'
else:
random_num = random.randint(1, nc3)
file_name = path + '/' + 'three/three-' + str(random_num) + '.mat'
matfile = h5py.File(file_name)
x = matfile['data']
x = numpy.transpose(x.value, axes=(1, 0))
trainData[line,:,:]=x
y.append(labelClass)
trainData = trainData.reshape(trainData.shape[0], img_rows, img_cols, 1)
return trainData,y
此代码正在运行,但batch_size设置为16,但keras的输出是这样的
1/50000 [..............................] - ETA: 65067s - loss: 1.1666 - acc: 0.2500
2/50000 [..............................] - ETA: 34057s - loss: 1.4812 - acc: 0.2188
3/50000 [..............................] - ETA: 24202s - loss: 1.6554 - acc: 0.1875
4/50000 [..............................] - ETA: 18799s - loss: 1.5569 - acc: 0.2344
5/50000 [..............................] - ETA: 15611s - loss: 1.4662 - acc: 0.2625
6/50000 [..............................] - ETA: 13863s - loss: 1.4563 - acc: 0.2500
8/50000 [..............................] - ETA: 10978s - loss: 1.3903 - acc: 0.2734
9/50000 [..............................] - ETA: 10402s - loss: 1.3595 - acc: 0.2778
10/50000 [..............................] - ETA: 10253s - loss: 1.3333 - acc: 0.2875
11/50000 [..............................] - ETA: 10389s - loss: 1.3195 - acc: 0.2784
12/50000 [..............................] - ETA: 10411s - loss: 1.3063 - acc: 0.2760
13/50000 [..............................] - ETA: 10360s - loss: 1.2896 - acc: 0.2788
14/50000 [..............................] - ETA: 10424s - loss: 1.2772 - acc: 0.2768
15/50000 [..............................] - ETA: 10464s - loss: 1.2660 - acc: 0.2750
16/50000 [..............................] - ETA: 10483s - loss: 1.2545 - acc: 0.2852
17/50000 [..............................] - ETA: 10557s - loss: 1.2446 - acc: 0.3015
似乎不考虑batch_size。你能说出原因吗? 谢谢。
答案 0 :(得分:2)
step
中的每个train_generator
(问题中未显示的代码)都是批处理。
所以:
generator
定义 - 但它未在打印输出中显示。 steps_per_epoch
参数是从生成器中抽取的批次数。每个步骤(或批处理)都打印在该输出中。 epochs
参数将定义重复所有内容的次数。您在输出中清楚地选择了steps_per_epoch = 50000
。所以,它假设你要训练50000批次。它将从发电机中检索50000批次。 (但批次的大小由发电机定义)。
检查批量大小:
有两种方法可以检查批量大小:
来自发电机:
generator = generate_arrays_from_file(path,nc1,nc2,nc3)
generatorSampleX, generatorSampleY = generator.next() #or next(generator)
print(generatorSampleX.shape)
#this will set the generator to the second element, so, it would be good to create the generator again before giving it to training
来自回调:
from keras.callbacks import LambdaCallback
callback = LambdaCallback(on_batch_end=lambda batch,logs:print(logs))
model.fit_generator(........, callbacks = [callback])