Question

我目前正在尝试使用具有自己的输入数据的神经网络对文本进行分类。

由于我的数据集非常有限，每个文件大约有1500个单词，大约有85个正分类和85个负分类文本，因此被告知要对我的神经网络使用交叉验证测试来绕过过度拟合。

我开始在YT视频和指南的帮助下构建神经网络，现在我的问题是如何进行交叉验证测试。

我当前的代码如下：

data = pd.read_excel('dataset3.xlsx' )
max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, lower=True, char_level=False)

train_size = int(len(data) * .8)
train_posts = data['Content'][:train_size]
train_tags = data['Value'][:train_size]
test_posts = data['Content'][train_size:]
test_tags = data['Value'][train_size:]

tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)

encoder = LabelEncoder()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)

batch_size = 1
epochs = 20
model = Sequential()
model.add(Dense(750, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes))
model.add(Activation('sigmoid'))

model.summary()

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs= epochs,
                    verbose=1,
                    validation_split=0.2)

我玩过

KFold(n_splits=k, shuffle=True, random_state=1).split(x_train, y_train))

但是我不知道如何在神经网络本身上使用它，希望您能帮助我解决我的问题。

感谢

Jason

Answer 1

您可以像这样使用scikit KFold.split：

accuracy = []
kf = KFold(n_splits=k, shuffle=True, random_state=1)
for trainIndices, testIndices in kf.split(x_train, y_train):
    #Start your model
    model = Sequential()
    ...
    history = model.fit(x[trainIndices], y[trainIndices],
                    batch_size=batch_size,
                    epochs= epochs,
                    verbose=1,
                    validation_split=0.2)

    prediction = model.predict(x[testIndices])
    accuracy.append(accuracy_score(y[trainIndices], prediction))

# Print the mean accuracy
print(sum(accuracy)/len(accuracy))

工作原理

如果需要有关该技术本身的更多信息，请查看here。

关于scikit-learn实现，kf.split产生k对火车索引和测试索引对。 Yielding values意味着可以像列表一样遍历此函数。另外，请记住，此函数为您提供了索引，因此，为了训练模型，您必须获得如下值：x[trainIndices]

对于每个k模型，您将在测试分区上计算该训练模型的准确性^[*]。之后，您可以计算平均准确度以确定模型的整体准确度。

^{[*]为此，我使用了scikit learning中的accuracy score，但是可以手动完成。}

Answer 2

您要在验证或测试中进行K折吗？

K折非常简单（您可以使用Random自己完成）。它将输入列表拆分为k个子集，返回2个数组，第一个（较大）是（k-1）个子集中的项的索引，第二个（较小）是第k个子集的索引。然后，您决定如何使用它。因此，K-Flod要做的是，帮助您在培训和测试（或验证）之间选择最佳组合。

代码将是这样的。

kfold = KFold(n_splits=k, shuffle=True, random_state=n) # Choose yours k and n
for arr1, arr2 in kfold.split(X, y):
  x_train, y_train = X[arr1], y[arr1] #
  x_k, y_k = X[arr2], y[arr2]  # 'x_k' should be 'x_test' or 'x_valid' depending on your purpose
  # train your model

因此，如果您想进行K折验证，下一个代码将是

  model = YourModel()
  model.fit(X_train, y_train, validation_data=(X_k, y_k), epochs=...)

或在测试中进行K形折叠

  model = YourModel()
  model.fit(X_train, y_train, epochs=...) # train without validation
  # model.fit(X_train, y_train, validation_split= ... ) # train with validation
  model.evaluate(X_k)

您会注意到，使用K形折叠时，您会训练模型k次。

神经网络中的K-fold用于文本分类

2 个答案:

工作原理