Question

我有两个用于训练和测试图像数据集的文件夹，但是都包含这样的不同标签，

training-
         |-a  -img1.png
               img2.png
         |-as -img1.png
               img2.png
         |-are-img1.png
testing -
         |-as -img1.png
         |-and-img1.png
               img1.png

如何使用此数据集创建ytrain和ytest？

我尝试了以下代码，

datagen = ImageDataGenerator(rescale=1. / 255)  
generator = datagen.flow_from_directory(train_data_dir,  
target_size=(img_width, img_height),  
batch_size=batch_size,  
class_mode=None,  
shuffle=False)  

nb_train_samples = len(generator.filenames)  
num_classes = len(generator.class_indices)

找到了316张属于68类的图像。

generator = datagen.flow_from_directory(  
test_data_dir,  
target_size=(img_width, img_height),  
batch_size=batch_size,  
class_mode=None,  
shuffle=False)  
nb_test_samples = len(generator.filenames)

找到226个属于48个类别的图像。
这是标记的正确方法吗？ 因为两个数据集都包含不同的文件夹名称（a，as，are）和（as，and）

建立模型时，我的准确度为0％

model = Sequential()  
model.add(Flatten(input_shape=train_data.shape[1:]))  
model.add(Dense(256, activation='relu'))  
model.add(Dropout(0.5))  
model.add(Dense(num_classes, activation='sigmoid'))  

model.compile(optimizer='rmsprop',  
          loss='categorical_crossentropy', metrics=['accuracy'])  

history = model.fit(train_data, 
train_labels,epochs=epochs,batch_size=batch_size,test_data=(test_data, test_labels))  

model.save_weights(top_model_weights_path)  

(eval_loss, eval_accuracy) = model.evaluate(  
 test_data, test_labels, batch_size=batch_size, verbose=1)

Answer 1

我建议您合并两个数据集，对它们进行混洗，然后再次拆分，以得到训练和测试带有相等标签的数据集。这是正确的标注方式，因为模型需要“查看”所有可能的标注，并将它们与测试数据集进行比较。

为此，您可以使用gapcv：

安装库：

pip install gapcv

混合文件夹：

from gapcv.utils.img_tools import ImgUtils
gap = ImgUtils(root_path='root_folder{}/training'.format('_t2'))
gap.transf='2to1'
gap.transform()

这将创建具有以下结构的文件夹：

root_folder-
         |-a  -img1.png
               img2.png
         |-as -img1.png
               img2.png
         |-are-img1.png
         |-and-img1.png
               img1.png

选项1

使用gapcv将数据集预处理为可共享的h5文件，并使用图像将其放入keras模型中：

import os
if not os.path.isfile('name_data_set.h5'):
    # this will create the `h5` file if it doesn't exist
    images = Images('name_data_set', 'root_folder', config=['resize=(224,224)', 'store'])

# this will stream the data from the `h5` file so you don't overload your memory
images = Images(config=['stream'], augment=['flip=both', 'edge', 'zoom=0.3', 'denoise']) # augment if it's needed if not use just Images(config=['stream']), norm 1.0/255.0 by default.
images.load('name_data_set')

#Metadata

print('images train')
print('Time to load data set:', images.elapsed)
print('Number of images in data set:', images.count)
print('classes:', images.classes)

发电机：

images.split = 0.2
images.minibatch = 32
gap_generator = images.minibatch
X_test, Y_test = images.test

适合keras模型：

model.fit_generator(generator=gap_generator,
                    validation_data=(X_test, Y_test),
                    epochs=epochs,
                    steps_per_epoch=steps_per_epoch)

为什么要使用gapcv？好了，模型拟合速度比ImageDataGenerator()快两倍：）

选项2

使用gapcv来用相同的标签对数据集进行混洗和拆分：

gap = ImgUtils(root_path='root_folder')

# Tree 2
gap.transform(shufle=True, img_split=0.2)

照常使用keras ImageDataGenerator()。

文档：

训练notebook来混合和拆分文件夹。
gapcv文档。

让我知道如何进行。：）

Answer 2

间隙对于这些类型的问题非常灵活。我最喜欢的组合单独的训练和测试数据集的方法是使用Gap的数据集合并功能（+ =运算符），如下所示：

# load the images from the Training directory
images = Images('name_of_dataset', 'training', config=['resize=(224,224)', 'store'])

# load the images from the Testing directory and merge them with the Training data
images += Images('name_of_dataset', 'testing', config=['resize=(224,224)', 'store'])

使用keras构建CNN模型，以进行不均匀的训练和测试图像文件夹数据

2 个答案: