Question

我是使用python进行机器学习的新手。我试图预测一个因素，比如说房屋价格，但是我正在使用更高阶的多项式特征来创建模型。所以我有2个数据集。我已经使用一个数据集准备了模型。如何在全新的数据集上实现此模型？我在下面附上我的代码：

data1 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz1.csv", engine = 'c', dtype=float, delimiter = ",")
data2 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz2.csv", engine = 'c', dtype=float, delimiter = ",")

#I have to do this step otherwise everytime i get an error of NaN or infinite value
data1.fillna(0.000, inplace=True)
data2.fillna(0.000, inplace=True)

X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result

x2_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X_train)
x3_ = PolynomialFeatures(degree=3, include_bias=False).fit_transform(X_train)

model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)

r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)

y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)

所以基本上我在此之后被困住了。如何在测试数据上实现相同的模型以预测y_test值并计算分数？

Answer 1

要重现PolynomialFeatures的效果，您需要存储对象本身（一次存储degree=2，然后再次存储degree=3。）否则，您将无法应用拟合的变换到测试数据集。

X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result

# store these data transform objects
pf2 = PolynomialFeatures(degree=2, include_bias=False)
pf3 = PolynomialFeatures(degree=3, include_bias=False)

# then apply the transform to the training set
x2_ = pf2.fit_transform(X_train)
x3_ = pf3.fit_transform(X_train)

model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)

r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)

y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)

# now apply the fitted transform to the test set
x2_test = pf2.transform(X_test)
x3_test = pf3.transform(X_test)

# apply trained model to transformed test data
y2_test_pred = model2.predict(x2_test)
y3_test_pred = model3.predict(x3_test)

# compute the model accuracy for the test data
r_sq2_test = model2.score(x2_test, y_test)
r_sq3_test = model3.score(x3_test, y_test)

如何在新数据集上实现模型

1 个答案: