如何将一组文本标签转换为向量输入到Keras中的CNN?

时间:2018-02-10 21:30:33

标签: python keras

我有一组数据,这些数据是组织成文件夹的一堆图片:

/animals
    /dogs
    /cats
    /snakes
    /pandas

等,有10个不同的类别

我有一个名为trainingImages[]的数组,其中包含我所有的预处理数据(灰度,32x32)

我有一个名为trainingLabels[]的数组,其中包含所有标签,它们与trainingImages []匹配。所以trainingImages [1]是预处理的狗,trainingLabels [1]是字符串'dog'

然后我像这样使用了sklearns train_test_split()

(trainX, testX, trainY, testY) = train_test_split(trainingImages, trainingLabels, test_size=0.2, random_state=1) 

此时trainX和trainY的形状分别为:(1095, 32, 32, 1) (1095, 20, 2)

我知道我现在必须将trainY转换为单热矢量。我尝试过使用LabelBinarizerto_categorical,但我仍有形状问题:

lb = LabelBinarizer().fit(trainY)
testY = lb.transform(testY)
trainY = lb.transform(trainY)

testY = keras.utils.to_categorical(testY)
trainY = keras.utils.to_categorical(trainY)

但是当我将其输入ValueError: Error when checking target: expected conv2d_1 to have 4 dimensions, but got array with shape (1095, 20, 2)模型时,我收到Sequential错误,该模型告诉我输入时形状是错误的。

我如何正确准备这些数据?

编辑:

代码:

inputWidth = 32
inputHeight = 32
inputDepth = 1

batchSize = 32
inputShape = (inputHeight, inputWidth, inputDepth)

trainDataGenerator = ImageDataGenerator(rescale=1. /255, shear_range=0.2,
    zoom_range=0.2, horizontal_flip=True)
testDataGenerator = ImageDataGenerator(rescale=1. /255)

trainingSet = trainDataGenerator.flow_from_directory(args.dataset,
    target_size=(32, 32), batch_size=batchSize, class_mode='categorical',
    color_mode='grayscale')
testingSet = testDataGenerator.flow_from_directory(args.dataset,
    target_size=(32, 32), batch_size=batchSize, class_mode='categorical',
    color_mode='grayscale')

model = Sequential()
model.add(Conv2D(20, (5, 5), padding="same", input_shape=inputShape))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

model.add(Conv2D(50, (5, 5), padding="same"))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

model.add(Flatten())
model.add(Dense(500))
model.add(Activation("relu"))

model.add(Dense(trainingClasses))
model.add(Activation("softmax"))

opt = SGD(lr=0.01)
model.compile(loss="categorical_crossentropy", optimizer=opt,
    metrics=["accuracy"])

model.fit_generator(trainingSet, steps_per_epoch=80, epochs=20,
    validation_data=testingSet)

1 个答案:

答案 0 :(得分:1)

您可以使用Keras ImageDataGenerator。生成器获取根目录中的所有文件夹,并为每个文件夹创建类别。每个分类文件夹中的所有文件都自动分配到其父文件夹的类别。然后可以在2分钟内手动完成测试和训练集之间的拆分。

batch_size = 32
# Noise data by zooming, rotating and flipping for more diverse training

train_datagen = ImageDataGenerator(rescale=1. / 255,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

training_set = train_datagen.flow_from_directory('folder/of/training/root/directory',
                                                 target_size=input_size,
                                                 batch_size=batch_size,
                                                 class_mode='categorical')

test_set = test_datagen.flow_from_directory('folder/of/test/root/directory',
                                            target_size=input_size,
                                            batch_size=batch_size,
                                            class_mode='categorical')


# Train the CNN on catigories defined by the folder structure
classifier.fit_generator(training_set,
                         steps_per_epoch=8000/batch_size,
                         epochs=90,
                         validation_data=test_set,
                         validation_steps=2000/batch_size,
                         workers=12)

您可以通过发出以下内容获得一个热门编码的catigories:

print("The model class indices are:", training_set.class_indices)