Keras:如何从随机样本中获取验证集?

时间:2018-09-21 08:16:53

标签: python tensorflow keras

我目前正在训练Keras模型,其对应的拟合调用如下:

model.fit(X,y_train,batch_size=myBatchSize,epochs=myAmountOfEpochs,validation_split=0.1,callbacks=myCallbackList)

This comment在Keras Github页面上解释了“ validation_split = 0.1”的含义:

  

不一定要从每个类中获取验证数据,   只是数据的最后10%(假设您要求的是10%)。

我的问题现在是:是否有一种简单的方法可以随机选择我的训练数据的10%作为验证数据?我想使用随机选取的样本的原因是,就我而言,最后10%的数据不一定包含所有类。

非常感谢您。

6 个答案:

答案 0 :(得分:2)

Keras除了提供一部分训练数据进行验证之外,没有提供更多高级功能。如果您需要更高级的知识(例如分层抽样)以确保样本中的类得到了很好的表示,那么您需要在Keras之外手动执行此操作(使用scikit-learn或numpy),然后将验证数据通过validation_data

中的model.fit参数

答案 1 :(得分:1)

由于Matias Valdenegro的评论,我被启发去进一步研究,并提出了以下解决问题的方法:

from sklearn.model_selection import train_test_split
[input: X and Y]
XTraining, XValidation, YTraining, YValidation = train_test_split(X,Y,stratify=Y,test_size=0.1) # before model building
[The model is built here...]
model.fit(XTraining,YTraining,batch_size=batchSize,epochs=amountOfEpochs,validation_data=(XValidation,YValidation),callbacks=callbackList)

答案 2 :(得分:1)

this post中,我建议了一种使用split-folders的解决方案 软件包可将您的主数据目录随机分为培训和验证目录,同时维护类的子文件夹。然后,您可以使用keras .flow_from_directory方法来指定训练和验证路径。

从文档中拆分文件夹:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

输入文件夹应采用以下格式:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

为了给你这个:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

使用keras ImageDataGenerator构建训练和验证数据集:

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)

答案 3 :(得分:0)

评论不够长,所以我将其发布在这里。

如果您有1000个训练数据,100个测试数据,validation_split = 0.1和batch_size = 100,则其作用是:按训练数据划分(批次1:90个训练和10个验证,批次2:90个训练和10个验证,...,全部以原始顺序显示(90,10,90,10 ... 90,10),并且与100个测试数据无关(模型永远不会看到)。因此,我想您只希望洗净所有10号验证集,而无需接触90号训练集。我可能会做的是手动将数据的10%部分改组,因为这就是shuffle=True的工作,它只会改组索引,并用新的改组索引之一替换旧的训练数据,就像这样):

import numpy as np
train_index = np.arange(1000,dtype=np.int32)
split = 0.1
batch_size = 100
num_batch = int(len(train_index)/batch_size)
train_index = np.reshape(train_index,(num_batch,batch_size))
for i in range(num_batch):
    r = np.random.choice(range(10),10,replace=False)
    print(r)
    train_index[i,int((1-split)*batch_size):] = np.array(r+((1-split)*batch_size)+i*batch_size)
    print(train_index[i])

flatten_index = train_index.reshape(-1)
print(flatten_index)

x_train = np.arange(1000,2000)
x_train = x_train[flatten_index]
print(x_train)

答案 4 :(得分:-1)

根据Keras Getting Started FAQ,您可以在shuffle中使用model.fit参数。

答案 5 :(得分:-1)

model.fit()参数中,validation_data将覆盖validation_split,因此无需同时配置它们。

validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.

validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`

但是有一个选项可以实现您的目的,即参数shuffle

shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.

所以您可以做的是:

model.fit(**other_kwargs, validation_split = 0.1, shuffle=True)