Question

基本上，我正在线性回归模型上部署概念验证，以根据特定数据集验证准确度系数百分比。对于高级别的先前构建我的模型我在我的数据集中应用了一种操作，以确保输入所需的所有列都是数字和OK。

数据集概述表明所有列都是数字且格式正确。的预测因子：

：定位：

我运行一个描述来获取更多细节并再次验证值。（红色预测器和黄色目标）

部署模型：

# split training and test
X_train, X_test,y_train,y_test = train_test_split (X,y,test_size=0.80,random_state = 33)

# Apply the scaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train.reshape(-1,1))
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train.reshape(-1,1))

# split the tragets in training/test
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test.reshape(-1,1))

# Create model linear regression
clf_sgd = linear_model.SGDRegressor(loss='squared_loss',penalty=None,random_state=33)
#clf_sgd = LinearRegression()

# Learning based in the model
clf_sgd.fit(X_train,y_train.ravel())
print("Coefficient de determination:",clf_sgd.score(X_train,y_train))
# Model performance
y_pred = clf_sgd.predict(X_test)
print("Coefficient de determination:{0:.3f}".format(metrics.r2_score(y_test,y_pred)))

不幸的是，我的结果非常糟糕，非常糟糕。

我期待听取并收集关于如何改进我的模型的想法，如果没有太多这方面的经验，我会很高兴。非常感谢。

Answer 1

您可以改进两件事：

1）您需要正确配置线性模型的超参数。 scikit-learn SGDRegressor对多个参数的值选择非常敏感，最重要的是alpha，penalty，loss和max_iter。浏览一下并尝试了解一种名为交叉验证的技术，并根据您的数据使用它来确定这些参数的合理值。

2）除非在非常具体的情况下，您实际上不需要缩放目标变量y

使用sklearn和pandas改进线性回归的POC

1 个答案: