火车测试拆分未正确拆分

时间:2021-01-08 18:24:16

标签: python numpy tensorflow deep-learning

我仍然是 AI 和深度学习的初学者,但我想测试神经网络是否能够计算两个数字的总和,因此我生成了一个包含 5000 个数字的数据集并使测试大小 = 0.3,因此训练数据集将等于 3500,但奇怪的是我发现该模型仅在 110 个输入而不是 3500 个输入上进行训练。

使用的代码:

import tensorflow as tf
from sklearn.model_selection import train_test_split
import numpy as np
from random import random


def generate_dataset(num_samples, test_size=0.33):
    """Generates train/test data for sum operation
    :param num_samples (int): Num of total samples in dataset
    :param test_size (int): Ratio of num_samples used as test set
    :return x_train (ndarray): 2d array with input data for training
    :return x_test (ndarray): 2d array with input data for testing
    :return y_train (ndarray): 2d array with target data for training
    :return y_test (ndarray): 2d array with target data for testing
    """

    # build inputs/targets for sum operation: y[0][0] = x[0][0] + x[0][1]
    x = np.array([[random()/2 for _ in range(2)] for _ in range(num_samples)])
    y = np.array([[i[0] + i[1]] for i in x])

    # split dataset into test and training sets
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size)
    return x_train, x_test, y_train, y_test


if __name__ == "__main__":

    # create a dataset with 2000 samples
    x_train, x_test, y_train, y_test = generate_dataset(5000, 0.3)

    # build model with 3 layers: 2 -> 5 -> 1
    model = tf.keras.models.Sequential([
      tf.keras.layers.Dense(5, input_dim=2, activation="sigmoid"),
      tf.keras.layers.Dense(1, activation="sigmoid")
    ])

    # choose optimiser
    optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

    # compile model
    model.compile(optimizer=optimizer, loss='mse')

    # train model
    model.fit(x_train, y_train, epochs=100)

    # evaluate model on test set
    print("\nEvaluation on the test set:")
    model.evaluate(x_test,  y_test, verbose=2)

    # get predictions
    data = np.array([[0.1, 0.2], [0.2, 0.2]])
    predictions = model.predict(data)

    # print predictions
    print("\nPredictions:")
    for d, p in zip(data, predictions):
        print("{} + {} = {}".format(d[0], d[1], p[0]))

enter image description here

1 个答案:

答案 0 :(得分:2)

您在图像中看到的 110/110 实际上是批次计数,而不是样本计数。因此,110 个批次 * 32 的默认批次大小为您提供了大约 3500 个训练样本,这与您期望的 5000 的 70% 相匹配。

您可以通过另一种方式返回到最后一批将是部分批次,因为它不能被 32 整除:

>>> (.7 * 5000) / 110
31.818181818181817

在神经网络中,一个 epoch 是对数据的一次完整传递。它以小批量(也称为步骤)进行训练,这就是 Keras 记录它们的方式。