我正在尝试训练一个模型来预测汽车价格。数据集来自kaggle: https://www.kaggle.com/vfsousas/autos#autos.csv
我正在使用以下代码准备数据:
class CarDataset(DataSet):
def __init__(self, csv_file):
df = pd.read_csv(csv_file).drop(["dateCrawled", "name", "abtest", "dateCreated", "nrOfPictures", "postalCode", "lastSeen"], axis = 1)
df = df.drop(df[df["seller"] == "gewerblich"].index).drop(["seller"], axis = 1)
df = df.drop(df[df["offerType"] == "Gesuch"].index).drop(["offerType"], axis = 1)
df = df[df["vehicleType"].notnull()]
df = df[df["notRepairedDamage"].notnull()]
df = df[df["model"].notnull()]
df = df[df["fuelType"].notnull()]
df = df[(df["price"] > 100) & (df["price"] < 100000)]
df = df[(df["monthOfRegistration"] > 0) & (df["monthOfRegistration"] < 13)]
df = df[(df["yearOfRegistration"] < 2019) & (df["yearOfRegistration"] > 1950)]
df = df[(df["powerPS"] > 20) & (df["powerPS"] < 550)]
df["hasDamage"] = np.where(df["notRepairedDamage"] == "ja", 1, 0)
df["automatic"] = np.where(df["gearbox"] == "manuell", 1, 0)
df["fuel"] = np.where(df["fuelType"] == "benzin", 0, 1)
df["age"] = (2019 - df["yearOfRegistration"]) * 12 + df["monthOfRegistration"]
df = df.drop(["notRepairedDamage", "gearbox", "fuelType", "yearOfRegistration", "monthOfRegistration"], axis = 1)
df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])
self.df = df
self.Y = self.df["price"].values
self.X = self.df.drop(["price"], axis = 1).values
scaler = StandardScaler()
scaler.fit(self.X)
self.X = scaler.transform(self.X)
self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.X,
self.Y,
test_size = 0.25,
random_state = 0)
self.x_train, self.x_valid, self.y_train, self.y_valid = train_test_split(self.x_train,
self.y_train,
test_size = 0.25,
random_state = 0)
def get_input_shape(self):
return (len(self.df.columns)-1, ) # (303, )
这将产生以下准备好的数据集:
price powerPS kilometer hasDamage automatic fuel age vehicleType_andere vehicleType_bus vehicleType_cabrio vehicleType_coupe ... brand_rover brand_saab brand_seat brand_skoda brand_smart brand_subaru brand_suzuki brand_toyota brand_trabant brand_volkswagen brand_volvo
3 1500 75 150000 0 1 0 222 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1 0
4 3600 69 90000 0 1 1 139 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0 0
5 650 102 150000 1 1 0 298 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
6 2200 109 150000 0 1 0 188 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0 0
10 2000 105 150000 0 1 0 192 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
[5 rows x 304 columns]
hasDamage
是标志(0或1),指示汽车是否有未修复的损坏
automatic
是标志(0或1),指示汽车是手动还是自动换档
fuel
对于柴油为0,对于汽油为1
age
是几个月的汽车寿命
brand
,model
和vehicleType
列将通过使用df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])
进行一次热编码。
此外,我将使用StandardScaler
来转换X值。
数据集现在包含X列的303列,当然也包含“价格”列的Y列。
使用此数据集,常规LinearRegression
在训练和测试集上的得分约为0.7。
现在,我已经尝试过使用keras进行深度学习的方法,但是无论我做什么,mse
和损失正在逐渐消失,该模型似乎无法学习任何东西:>
input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_1")(model_stack)
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_2")(model_stack)
model_stack = Dense(1, name = "Output")(model_stack)
model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])
model.summary()
callbacks = []
callbacks.append(ReduceLROnPlateau(monitor = "val_loss", factor = 0.95, verbose = self.verbose, patience = 1))
callbacks.append(EarlyStopping(monitor='val_loss', patience = 5, min_delta = 0.01, restore_best_weights = True, verbose = self.verbose))
model.fit(x = dataset.x_train,
y = dataset.y_train,
verbose = 1,
batch_size = 128,
epochs = 200,
validation_data = [dataset.x_valid, dataset.y_valid],
callbacks = callbacks)
score = model.evaluate(dataset.x_test, dataset.y_test, verbose = 1)
print("Model score: {}".format(score))
摘要/训练如下(学习率为3e-4
):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 6) 0
_________________________________________________________________
dense_1 (Dense) (None, 20) 140
_________________________________________________________________
relu_1 (Activation) (None, 20) 0
_________________________________________________________________
dense_2 (Dense) (None, 20) 420
_________________________________________________________________
relu_2 (Activation) (None, 20) 0
_________________________________________________________________
Output (Dense) (None, 1) 21
=================================================================
Total params: 581
Trainable params: 581
Non-trainable params: 0
_________________________________________________________________
Train on 182557 samples, validate on 60853 samples
Epoch 1/200
182557/182557 [==============================] - 2s 13us/step - loss: 110046953.4602 - mean_squared_error: 110046953.4602 - acc: 0.0000e+00 - val_loss: 107416331.4062 - val_mean_squared_error: 107416331.4062 - val_acc: 0.0000e+00
Epoch 2/200
182557/182557 [==============================] - 2s 11us/step - loss: 97859920.3050 - mean_squared_error: 97859920.3050 - acc: 0.0000e+00 - val_loss: 85956634.8803 - val_mean_squared_error: 85956634.8803 - val_acc: 1.6433e-05
Epoch 3/200
182557/182557 [==============================] - 2s 12us/step - loss: 70531052.0493 - mean_squared_error: 70531052.0493 - acc: 2.1911e-05 - val_loss: 54933938.6787 - val_mean_squared_error: 54933938.6787 - val_acc: 3.2866e-05
Epoch 4/200
182557/182557 [==============================] - 2s 11us/step - loss: 42639802.3204 - mean_squared_error: 42639802.3204 - acc: 3.2866e-05 - val_loss: 32645940.6536 - val_mean_squared_error: 32645940.6536 - val_acc: 1.3146e-04
Epoch 5/200
182557/182557 [==============================] - 2s 11us/step - loss: 28282909.0699 - mean_squared_error: 28282909.0699 - acc: 1.4242e-04 - val_loss: 25315220.7446 - val_mean_squared_error: 25315220.7446 - val_acc: 9.8598e-05
Epoch 6/200
182557/182557 [==============================] - 2s 11us/step - loss: 24279169.5270 - mean_squared_error: 24279169.5270 - acc: 3.8344e-05 - val_loss: 23420569.2554 - val_mean_squared_error: 23420569.2554 - val_acc: 9.8598e-05
Epoch 7/200
182557/182557 [==============================] - 2s 11us/step - loss: 22874003.0459 - mean_squared_error: 22874003.0459 - acc: 9.8599e-05 - val_loss: 22380401.0622 - val_mean_squared_error: 22380401.0622 - val_acc: 1.6433e-05
...
Epoch 197/200
182557/182557 [==============================] - 2s 12us/step - loss: 13828827.1595 - mean_squared_error: 13828827.1595 - acc: 3.3414e-04 - val_loss: 14123447.1746 - val_mean_squared_error: 14123447.1746 - val_acc: 3.1223e-04
Epoch 00197: ReduceLROnPlateau reducing learning rate to 0.00020950120233464986.
Epoch 198/200
182557/182557 [==============================] - 2s 13us/step - loss: 13827193.5994 - mean_squared_error: 13827193.5994 - acc: 2.4102e-04 - val_loss: 14116898.8054 - val_mean_squared_error: 14116898.8054 - val_acc: 1.6433e-04
Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.00019902614221791736.
Epoch 199/200
182557/182557 [==============================] - 2s 12us/step - loss: 13823582.4300 - mean_squared_error: 13823582.4300 - acc: 3.3962e-04 - val_loss: 14108715.5067 - val_mean_squared_error: 14108715.5067 - val_acc: 4.1083e-04
Epoch 200/200
182557/182557 [==============================] - 2s 11us/step - loss: 13820568.7721 - mean_squared_error: 13820568.7721 - acc: 3.1223e-04 - val_loss: 14106001.7681 - val_mean_squared_error: 14106001.7681 - val_acc: 2.3006e-04
60853/60853 [==============================] - 1s 18us/step
Model score: [14106001.790199332, 14106001.790199332, 0.00023006260989597883]
我仍然是机器学习的初学者。我的方法中是否有任何重大错误?我在做什么错了?
答案 0 :(得分:1)
所以,过了一会儿,我发现了指向正确数据集的kaggle链接。我首先使用的是https://www.kaggle.com/vfsousas/autos,但是同样的数据也是这样的:https://www.kaggle.com/orgesleka/used-cars-database和222个内核一起来看一下。 现在查看https://www.kaggle.com/themanchanda/neural-network-approach的结果表明,这个人的损失也得到了“大数字”,这是我困惑的主要部分(因为到目前为止,我只处理“小数字”或“准确性”),让我重新考虑。
那对我来说很清楚:
sklearn
s LinearRegression
的其他指标进行了比较,这些指标无论如何还是无法比拟的简而言之:
sklearn
的{{1}}的默认评分功能是r2得分因此,我将指标更改为“ mae”和自定义的r2实现,因此可以将其与LinearRegression
进行比较。
事实证明,在第一次尝试大约100个纪元后,我的MAE达到了1900,r2-得分为0.69。
然后出于比较目的,我还计算了LinearRegression
的MAE,并将其评估为2855.417(r2得分为0.67)。
因此,实际上,就MAE和r2分数而言,深度学习方法已经更好。因此,没什么错,我现在可以继续调整/优化模型:)
答案 1 :(得分:0)
我的建议很少。
添加隐藏层中神经元的数量。
请尽量不要使用relu
,而应使用tanh
。
删除dropout
层,直到模型开始工作,然后可以将其重新添加并重新训练。
input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(128)(model_stack)
model_stack = Activation("tanh", name = "tanh_1")(model_stack)
model_stack = Dense(64)(model_stack)
model_stack = Activation("tanh", name = "tanh_2")(model_stack)
model_stack = Dense(1, name = "Output")(model_stack)
model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])
model.summary()
答案 2 :(得分:0)
您的模型似乎不合适。
Try adding more neurons as suggested already.
And also try to increase the number of layers.
Try using sigmoid as your activation function.
Try increasing your learning rate. You can switch between Adam or SGD learning as well.
总是从头开始进行模型拟合。尝试一次更改其中一个参数。然后一起改变两个,依此类推。此外,我建议您寻找与您的数据集相关的论文或已经完成的工作。这会给你一个方向。