大型数据集的Keras中的数据扩充

时间:2018-07-25 19:20:35

标签: keras classification large-data

我正在使用Keras训练图像分类模型,并且正在处理约50k张图像。每个图像具有三个通道,每个图像的大小为150x150。由于三个通道之间图像强度的微小差异,我必须使用浮点数来存储图像。我正在使用GPU进行训练,但是我的图形卡上没有足够的内存,也没有钱来升级我的GPU。我还必须扩充我的数据集,因为我的训练图像不能覆盖测试数据集中所有可能的旋转和平移。 我编写了自己的生成器,将输入图像和标签分成块,然后将其输入到Keras的数据增强例程和model.fit()中。下面是我的代码:

from __future__ import print_function
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import np_utils
from keras.callbacks import Callback
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau
from keras.callbacks import CSVLogger
from keras.callbacks import EarlyStopping, TensorBoard, LearningRateScheduler
from keras.optimizers import SGD, Adam, RMSprop
from keras import backend as K
import tensorflow as tf
from sklearn.model_selection import train_test_split

import numpy as np
import math
import myCNN # my own convolutional neural network

def myBatchGenerator(X_train_large, y_train_large, chunk_size):
    number_of_images = len(y_train_large)

    while True:
        batch_start = 0
        batch_end = chunk_size

        while batch_start < number_of_images:
            limit = min(batch_end, number_of_images)
            X = X_train_large[batch_start:limit,:,:,:]
            y = y_train_large[batch_start:limit,:]
            yield(X,y)

            batch_start += chunk_size
            batch_end += chunk_size

if __name__ == '__main__':
    input_image_shape = (150,150,3)
    # read input images and labels
    # X_train_large is an array of type float16          
    # y_train_large is an array of size number of images x number of classes 
    X_train_large, y_train_large = myFunctionToReadTrainingImagesAndLabels()

    # validation images: about 5000 images 
    X_validation_large, y_validation_large = 
                                  myFunctionToReadValidationImagesAndLabels() 
    # create a stratified sample from the large training set. use 100 samples from each class
    y_train_large_vectors = [np.where(r == 1)[0][0] for r in y_train_large]
    unique, counts = np.unique(y_train_large_vectors, return_counts=True)

    X_train_sample = np.empty((12000, 150, 150, 3))
    y_train_sample = np.empty((12000, 12))

    for idx in range(num_classes):
        start_idx_for_sample = 100*idx
        end_idx_for_sample = start_idx_for_sample+99
        start_idx_for_large = np.max(counts)*idx
        end_idx_for_large = start_idx_for_large+99

        X_train_sample[start_idx_for_sample:end_idx_for_sample,:,:,:] = X_train_large[start_idx_for_large:end_idx_for_large,:,:,:]
        y_train_sample[start_idx_for_sample:end_idx_for_sample,:] = y_train_large[start_idx_for_large:end_idx_for_large,:]

    # define augmentation needed for image data generator
    train_datagen = ImageDataGenerator(featurewise_center=False,  
                                       samplewise_center=False,  
                                       featurewise_std_normalization=False, 
                                       samplewise_std_normalization=False,  
                                       zca_whitening=False,  
                                       rotation_range=90,  
                                       width_shift_range=0.1,  
                                       height_shift_range=0.1, 
                                       horizontal_flip=True, 
                                       vertical_flip=True)  
                                       
    train_datagen.fit(X_train_sample)
    
    # load my model
    model = myCNN.build_model(input_image_shape)
    sgd = SGD(lr=0.05,decay=10e-4,momentum=0.9)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'
    
    for e in range(number_of_epochs):
       print('*********************epoch',e)
       # get 1000 images at a time from the input image set
       for X_train, y_train in myBatchGenerator(X_train_large, y_train_large,chunk_size=1000):
           # split it into batches of 32 images/labels and augment on the fly
           for X_batch, y_batch in train_datagen.flow(X_train_large,y_train_large,batch_size=32):
               # train
               model.fit(X_batch,y_batch,validation_data=(X_validation_large,y_validation_large))

    model.save('myCNN_trained_on_largedataset.h5')

简而言之, 1.创建输入图像的分层样本,以用于图像数据生成器。 2.我将输入的图像分成1000个图像的块,然后以32个批次的方式将这1000个图像提供给模型。

因此,我一次要在32张图像上训练我的模型,并在进行过程中对其进行扩充,然后在约5000张图像上验证模型。

我仍在运行我的模型,但是每批32张图像目前需要30秒才能解决。这意味着要花费大量时间才能解决一个纪元。我在这里想念东西。

我已经在较小的数据集上测试了CNN代码,它可以正常工作。因此,我知道问题不在于我读取输入图像的功能也不在于我的CNN。我认为这是将数据拆分为多个部分并进行批处理的方法。但是我无法弄清楚哪里出了问题。你能指导我吗?

提前感谢您的时间

1 个答案:

答案 0 :(得分:0)

为什么不使用 ImageDataGenerator 类中的flow_from_directory()?它是内置在keras中的,非常容易处理像您这样的问题!
具体来说, Flow_from_directory 可以直接从目录中提取批次,并且可以动态执行数据扩充。 还有两个例子可以建议您:

我认为足以让您的网络运行;)。
希望对您有所帮助,祝您好运!