我是机器学习的新手,我在使用SVR算法从数据集中获取结果时遇到了问题。我的数据包含1800个观察结果,我有66个特征来预测这些观察结果。我的简短描述代码如下:
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
y = df['Col1']
X = df[['Col2','Col4','Col8','Col9' etc]]
# STANDARDIZE AND DIVIDE INTO TRAIN AND TEST DATA
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = scaler.fit_transform(y)
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1)
# FIND BEST PARAMETERS
C_range = 10.0 ** np.arange(-4, 4)
gamma_range = 10.0 ** np.arange(-4, 4)
param_grid = dict(gamma=gamma_range.tolist(), C=C_range.tolist())
svr = GridSearchCV(SVR(kernel='rbf', gamma=0.1),param_grid, cv=10)
svr.fit(X_train,y_train)
print(svr.best_score_)
print(svr.best_estimator_)
# FIT THE MODEL USING BEST ESTIMATOR
model = SVR(C=10.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=100.0,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
model.fit(X_train,y_train)
results = model.predict(X_test)
运行此代码后,我的模型会为测试数据集中的每个观察结果预测相同的结果。
In: results
Out: array([ 0.02213646, 0.02213646, 0.02213646, 0.02213646, 0.02213646,
0.02213646, 0.02213646, 0.02213646, 0.02213646, 0.02213646 etc
我是否遗漏了代码中的内容?
似乎问题在于我试图预测的数据,因为代码可以正常使用不同的数据库/随机X和y。我想知道,当真实结果大不相同时,ML算法可能为测试数据集中的每个观察预测几乎相同的结果有任何特殊原因。