我在keras中遇到这个NN回归模型的问题。我正在研究汽车数据集,根据13个维度预测价格。简而言之,我已将其视为pandas数据帧,将数值转换为float,缩放值,然后对分类值使用单热编码,这创建了许多新列,但这并不关心我点。让我担心的是,准确度几乎为0%,我无法弄清楚原因。数据集可在此处找到:https://www.kaggle.com/CooperUnion/cardataset/data。以下是代码:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import to_categorical
# load dataset
# Columns : Make, Model, Year, Engine Fuel Type, Engine HP, Engine Cylinders, Transmission Type, Driven_Wheels, Number of Doors, Vehicle Size, Vehicle Style, highway MPG, city mpg, Popularity, MSRP
import pandas as pd
dataframe = pd.read_csv("cars.csv", header = 'infer', names=['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP', 'Engine Cylinders', 'Transmission Type', 'Driven_Wheels', 'Number of Doors', 'Vehicle Size', 'Vehicle Style', 'highway MPG', 'city mpg', 'Popularity', 'MSRP'])
#convert data columns to float
dataframe[['Engine HP', 'highway MPG', 'city mpg', 'Popularity', 'MSRP']] = dataframe[['Engine HP', 'highway MPG', 'city mpg', 'Popularity', 'MSRP']].apply(pd.to_numeric)
#normalize the values - divide my their max value
dataframe["Engine HP"] = dataframe["Engine HP"] / dataframe["Engine HP"].max()
dataframe["highway MPG"] = dataframe["highway MPG"] / dataframe["highway MPG"].max()
dataframe["city mpg"] = dataframe["city mpg"] / dataframe["city mpg"].max()
dataframe["Popularity"] = dataframe["Popularity"] / dataframe["Popularity"].max()
dataframe["MSRP"] = dataframe["MSRP"] / dataframe["MSRP"].max()
#split input and label
x = dataframe.iloc[:,0:14]
y = dataframe.iloc[:,14]
#one-hot encoding for categorical values
def one_hot(df, cols):
for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)
return df
#columns to transform
cols_to_tran = ['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine Cylinders', 'Transmission Type', 'Driven_Wheels', 'Number of Doors', 'Vehicle Size', 'Vehicle Style']
d = one_hot(x, cols_to_tran)
list(d.columns.values)
#drop first original 11 columns
e = d.drop(d.columns[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]], axis=1)
list(e.columns.values)
#create train and test datasets - 80% for train and 20% for validation
t = len(e)*0.8
t = int(t)
train_data = e[0:t]
train_targets = y[0:t]
test_data = e[t:]
test_targets = y[t:]
#convert to numpy array
train_data = train_data.values
train_targets = train_targets.values
test_data = test_data.values
test_targets = test_targets.values
# Sample Multilayer Perceptron Neural Network in Keras
from keras.models import Sequential
from keras.layers import Dense
import numpy
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
model.add(Dense(32, activation='relu'))
#model.add(Dense(1, activation='sigmoid'))
model.add(Dense(1))
# 2. compile the network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# 3. fit the network
history = model.fit(train_data, train_targets, epochs=100, batch_size=50)
# 4. evaluate the network
loss, accuracy = model.evaluate(test_data, test_targets)
print("\nLoss: %.2f, Accuracy: %.2f%%" % (loss, accuracy*100))
# 5. make predictions
probabilities = model.predict(test_data)
predictions = [float(x) for x in probabilities]
accuracy = numpy.mean(predictions == test_targets)
print("Prediction Accuracy: %.2f%%" % (accuracy*100))
结果如下:
任何帮助都将不胜感激。
答案 0 :(得分:3)
Accuraccy是一种分类指标,将其用于回归是没有意义的。没有实际问题。
答案 1 :(得分:0)
首先,在stackoverflow中发布问题时,您应该考虑清理代码。我已经尝试复制代码并发现一些错误,然后在numpy数组--This will create a table like #CSV_Output, to store intermediate results
SELECT * INTO #QueryResultsCast FROM #CSV_Output WHERE 0 = 1
--cast all via INSERT
INSERT INTO #QueryResultsCast
SELECT * FROM #QueryResults
--Original desired output (retrieving from #QueryResultsCast)
INSERT INTO #CSV_Output
SELECT * from #CSV_Column_Titles -- n" columns, all columns are nvarchar( MAX ) )
UNION ALL
SELECT * FROM #QueryResultsCast -- "n" columns, all must be implicitly converted to nvarchar( MAX )
,train_data
,train_targets
和test_data
上清理数据集。
专注于机器学习理论,如果你不打乱你的数据集,那么回归模型很难得到训练。在拆分训练和测试子集之前,尝试使用test_targets
对数据集进行混洗。
正如Matias answer所述,如果您正在处理回归问题(而不是分类问题),那么使用精确度指标是没有意义的。
此外,二元交叉熵损失也仅适用于分类,因此它没有意义。用于回归模型的典型损失是均方误差。考虑改变你的keras模型编译:
random.shuffle()
希望这有帮助!