深度学习回归-巨大的误解和损失

时间:2019-07-30 07:50:35

标签: tensorflow machine-learning keras deep-learning regression

我正在尝试训练一个模型来预测汽车价格。数据集来自kaggle: https://www.kaggle.com/vfsousas/autos#autos.csv

我正在使用以下代码准备数据:

class CarDataset(DataSet):

    def __init__(self, csv_file):
        df = pd.read_csv(csv_file).drop(["dateCrawled", "name", "abtest", "dateCreated", "nrOfPictures", "postalCode", "lastSeen"], axis = 1)

        df = df.drop(df[df["seller"] == "gewerblich"].index).drop(["seller"], axis = 1)
        df = df.drop(df[df["offerType"] == "Gesuch"].index).drop(["offerType"], axis = 1)

        df = df[df["vehicleType"].notnull()]
        df = df[df["notRepairedDamage"].notnull()]
        df = df[df["model"].notnull()]
        df = df[df["fuelType"].notnull()]

        df = df[(df["price"] > 100) & (df["price"] < 100000)]
        df = df[(df["monthOfRegistration"] > 0) & (df["monthOfRegistration"] < 13)]
        df = df[(df["yearOfRegistration"] < 2019) & (df["yearOfRegistration"] > 1950)]
        df = df[(df["powerPS"] > 20) & (df["powerPS"] < 550)]

        df["hasDamage"] = np.where(df["notRepairedDamage"] == "ja", 1, 0)
        df["automatic"] = np.where(df["gearbox"] == "manuell", 1, 0)
        df["fuel"] = np.where(df["fuelType"] == "benzin", 0, 1)
        df["age"] = (2019 - df["yearOfRegistration"]) * 12 + df["monthOfRegistration"]

        df = df.drop(["notRepairedDamage", "gearbox", "fuelType", "yearOfRegistration", "monthOfRegistration"], axis = 1)

        df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])

        self.df = df
        self.Y = self.df["price"].values
        self.X = self.df.drop(["price"], axis = 1).values

        scaler = StandardScaler()
        scaler.fit(self.X)

        self.X = scaler.transform(self.X)

        self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.X, 
                                                                                    self.Y, 
                                                                                    test_size = 0.25,
                                                                                    random_state = 0)

        self.x_train, self.x_valid, self.y_train, self.y_valid = train_test_split(self.x_train, 
                                                                                    self.y_train, 
                                                                                    test_size = 0.25,
                                                                                    random_state = 0)   

    def get_input_shape(self):
        return (len(self.df.columns)-1, )        # (303, )

这将产生以下准备好的数据集:

    price  powerPS  kilometer  hasDamage  automatic  fuel  age  vehicleType_andere  vehicleType_bus  vehicleType_cabrio  vehicleType_coupe  ...  brand_rover  brand_saab  brand_seat  brand_skoda  brand_smart  brand_subaru  brand_suzuki  brand_toyota  brand_trabant  brand_volkswagen  brand_volvo
3    1500       75     150000          0          1     0  222                   0                0                   0                  0  ...            0           0           0            0            0             0             0             0              0                 1            0
4    3600       69      90000          0          1     1  139                   0                0                   0                  0  ...            0           0           0            1            0             0             0             0              0                 0            0
5     650      102     150000          1          1     0  298                   0                0                   0                  0  ...            0           0           0            0            0             0             0             0              0                 0            0
6    2200      109     150000          0          1     0  188                   0                0                   1                  0  ...            0           0           0            0            0             0             0             0              0                 0            0
10   2000      105     150000          0          1     0  192                   0                0                   0                  0  ...            0           0           0            0            0             0             0             0              0                 0            0

[5 rows x 304 columns]

hasDamage是标志(0或1),指示汽车是否有未修复的损坏
automatic是标志(0或1),指示汽车是手动还是自动换档
fuel对于柴油为0,对于汽油为1
age是几个月的汽车寿命

brandmodelvehicleType列将通过使用df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])进行一次热编码。

此外,我将使用StandardScaler来转换X值。

数据集现在包含X列的303列,当然也包含“价格”列的Y列。

使用此数据集,常规LinearRegression在训练和测试集上的得分约为0.7。

现在,我已经尝试过使用keras进行深度学习的方法,但是无论我做什么,mse和损失正在逐渐消失,该模型似乎无法学习任何东西:

input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_1")(model_stack)

model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_2")(model_stack)

model_stack = Dense(1, name = "Output")(model_stack)

model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])

model.summary()

callbacks = []
callbacks.append(ReduceLROnPlateau(monitor = "val_loss", factor = 0.95, verbose = self.verbose, patience = 1))
callbacks.append(EarlyStopping(monitor='val_loss', patience = 5, min_delta = 0.01, restore_best_weights = True, verbose = self.verbose))


model.fit(x = dataset.x_train,
          y = dataset.y_train,
          verbose = 1,
          batch_size = 128,
          epochs = 200,
          validation_data = [dataset.x_valid, dataset.y_valid],
          callbacks = callbacks)

score = model.evaluate(dataset.x_test, dataset.y_test, verbose = 1)
print("Model score: {}".format(score))

摘要/训练如下(学习率为3e-4):

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 6)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 20)                140       
_________________________________________________________________
relu_1 (Activation)          (None, 20)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 20)                420       
_________________________________________________________________
relu_2 (Activation)          (None, 20)                0         
_________________________________________________________________
Output (Dense)               (None, 1)                 21        
=================================================================
Total params: 581
Trainable params: 581
Non-trainable params: 0
_________________________________________________________________
Train on 182557 samples, validate on 60853 samples
Epoch 1/200
182557/182557 [==============================] - 2s 13us/step - loss: 110046953.4602 - mean_squared_error: 110046953.4602 - acc: 0.0000e+00 - val_loss: 107416331.4062 - val_mean_squared_error: 107416331.4062 - val_acc: 0.0000e+00
Epoch 2/200
182557/182557 [==============================] - 2s 11us/step - loss: 97859920.3050 - mean_squared_error: 97859920.3050 - acc: 0.0000e+00 - val_loss: 85956634.8803 - val_mean_squared_error: 85956634.8803 - val_acc: 1.6433e-05
Epoch 3/200
182557/182557 [==============================] - 2s 12us/step - loss: 70531052.0493 - mean_squared_error: 70531052.0493 - acc: 2.1911e-05 - val_loss: 54933938.6787 - val_mean_squared_error: 54933938.6787 - val_acc: 3.2866e-05
Epoch 4/200
182557/182557 [==============================] - 2s 11us/step - loss: 42639802.3204 - mean_squared_error: 42639802.3204 - acc: 3.2866e-05 - val_loss: 32645940.6536 - val_mean_squared_error: 32645940.6536 - val_acc: 1.3146e-04
Epoch 5/200
182557/182557 [==============================] - 2s 11us/step - loss: 28282909.0699 - mean_squared_error: 28282909.0699 - acc: 1.4242e-04 - val_loss: 25315220.7446 - val_mean_squared_error: 25315220.7446 - val_acc: 9.8598e-05
Epoch 6/200
182557/182557 [==============================] - 2s 11us/step - loss: 24279169.5270 - mean_squared_error: 24279169.5270 - acc: 3.8344e-05 - val_loss: 23420569.2554 - val_mean_squared_error: 23420569.2554 - val_acc: 9.8598e-05
Epoch 7/200
182557/182557 [==============================] - 2s 11us/step - loss: 22874003.0459 - mean_squared_error: 22874003.0459 - acc: 9.8599e-05 - val_loss: 22380401.0622 - val_mean_squared_error: 22380401.0622 - val_acc: 1.6433e-05
...
Epoch 197/200
182557/182557 [==============================] - 2s 12us/step - loss: 13828827.1595 - mean_squared_error: 13828827.1595 - acc: 3.3414e-04 - val_loss: 14123447.1746 - val_mean_squared_error: 14123447.1746 - val_acc: 3.1223e-04

Epoch 00197: ReduceLROnPlateau reducing learning rate to 0.00020950120233464986.
Epoch 198/200
182557/182557 [==============================] - 2s 13us/step - loss: 13827193.5994 - mean_squared_error: 13827193.5994 - acc: 2.4102e-04 - val_loss: 14116898.8054 - val_mean_squared_error: 14116898.8054 - val_acc: 1.6433e-04

Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.00019902614221791736.
Epoch 199/200
182557/182557 [==============================] - 2s 12us/step - loss: 13823582.4300 - mean_squared_error: 13823582.4300 - acc: 3.3962e-04 - val_loss: 14108715.5067 - val_mean_squared_error: 14108715.5067 - val_acc: 4.1083e-04
Epoch 200/200
182557/182557 [==============================] - 2s 11us/step - loss: 13820568.7721 - mean_squared_error: 13820568.7721 - acc: 3.1223e-04 - val_loss: 14106001.7681 - val_mean_squared_error: 14106001.7681 - val_acc: 2.3006e-04
60853/60853 [==============================] - 1s 18us/step
Model score: [14106001.790199332, 14106001.790199332, 0.00023006260989597883]

我仍然是机器学习的初学者。我的方法中是否有任何重大错误?我在做什么错了?

3 个答案:

答案 0 :(得分:1)

解决方案

所以,过了一会儿,我发现了指向正确数据集的kaggle链接。我首先使用的是https://www.kaggle.com/vfsousas/autos,但是同样的数据也是这样的:https://www.kaggle.com/orgesleka/used-cars-database和222个内核一起来看一下。 现在查看https://www.kaggle.com/themanchanda/neural-network-approach的结果表明,这个人的损失也得到了“大数字”,这是我困惑的主要部分(因为到目前为止,我只处理“小数字”或“准确性”),让我重新考虑。

那对我来说很清楚:

  • 数据集已正确准备
  • 模型运行正常
  • 我使用了错误的指标/与sklearn s LinearRegression的其他指标进行了比较,这些指标无论如何还是无法比拟的

简而言之:

  • MAE(平均绝对误差)在2000左右意味着,对于汽车价格的预测,平均而言,它会错/错2000€(例如,正确的价格为10.000€,并且该模型预测为8.000€-12.000)
  • MSE(均方误差)当然要大得多,这是可以预料的,而不是我第一次解释的“垃圾”或错误的模型结果
  • “准确性”指标仅用于分类,对于回归无用
  • sklearn的{​​{1}}的默认评分功能是r2得分

因此,我将指标更改为“ mae”和自定义的r2实现,因此可以将其与LinearRegression进行比较。
事实证明,在第一次尝试大约100个纪元后,我的MAE达到了1900,r2-得分为0.69。

然后出于比较目的,我还计算了LinearRegression的MAE,并将其评估为2855.417(r2得分为0.67)。

因此,实际上,就MAE和r2分数而言,深度学习方法已经更好。因此,没什么错,我现在可以继续调整/优化模型:)

答案 1 :(得分:0)

我的建议很少。

  1. 添加隐藏层中神经元的数量。

  2. 请尽量不要使用relu,而应使用tanh

  3. 删除dropout层,直到模型开始工作,然后可以将其重新添加并重新训练。

input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(128)(model_stack)
model_stack = Activation("tanh", name = "tanh_1")(model_stack)

model_stack = Dense(64)(model_stack)
model_stack = Activation("tanh", name = "tanh_2")(model_stack)

model_stack = Dense(1, name = "Output")(model_stack)

model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])

model.summary()

答案 2 :(得分:0)

您的模型似乎不合适。

Try adding more neurons as suggested already. 
And also try to increase the number of layers. 
Try using sigmoid as your activation function. 
Try increasing your learning rate. You can switch between Adam or SGD learning as well. 

总是从头开始进行模型拟合。尝试一次更改其中一个参数。然后一起改变两个,依此类推。此外,我建议您寻找与您的数据集相关的论文或已经完成的工作。这会给你一个方向。