我正在尝试创建SV回归。我正在使用一些高斯噪声从sinc函数生成数据。
现在,为了找到RBF内核的最佳参数,我通过运行5次交叉验证来使用GridSearchCV。
P.S - 我是python和机器学习的新手,所以也许代码在某种程度上不是非常优化或正确。
我的代码:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
def generateData(N, sigmaT):
# Input datapoints
data = np.reshape(np.linspace(-10, 10, N), (N,1))
# Noise in target with zero mean and variance sigmaT
epi = np.random.normal(0 , sigmaT, N)
# Target
t1 = np.sinc(data).ravel() # target without noise
t2 = np.sinc(data).ravel() + epi # target with noise
t1 = np.reshape(t1, (N, 1))
t2 = np.reshape(t2, (N, 1))
# Plot the generated data
plt.plot(data, t1, '--r', label = 'Original Curve')
plt.scatter(data, t2, c = 'orange', label = 'Data')
plt.title("Generated data")
return data, t2, t1
# Generate data from sin funtion
N = 100 # Number of data points
sigmaT = 0.1 # Noise in the data
plt.figure(1)
X, y, true = generateData(N, sigmaT)
y = y.ravel()
# Tuning of parameters for regression by cross-validation
K = 5 # Number of cross valiations
# Parameters for tuning
parameters = [{'kernel': ['rbf'], 'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5, 0.6, 0.9],'C': [1, 10, 100, 1000, 10000]}]
print("Tuning hyper-parameters")
svr = GridSearchCV(SVR(epsilon = 0.01), parameters, cv = K)
svr.fit(X, y)
# Checking the score for all parameters
print("Grid scores on training set:")
means = svr.cv_results_['mean_test_score']
stds = svr.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, svr.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))
结果是
Best parameters set found on development set: {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
Grid scores on training set:
-0.240 (+/-0.366) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
-0.535 (+/-1.076) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1}
-0.863 (+/-1.379) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 1}
-3.057 (+/-4.954) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 1}
-1.576 (+/-3.185) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 1}
-0.439 (+/-0.048) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 1}
-0.417 (+/-0.110) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 1}
-0.370 (+/-0.248) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 1}
-0.514 (+/-0.724) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10}
-1.308 (+/-3.002) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10}
-4.717 (+/-10.886) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 10}
-14.247 (+/-27.218) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 10}
-15.241 (+/-19.086) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 10}
-0.533 (+/-0.571) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 10}
-0.566 (+/-0.527) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 10}
-1.087 (+/-1.828) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 10}
-0.591 (+/-1.218) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 100}
-2.111 (+/-2.940) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 100}
-19.591 (+/-29.731) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 100}
-96.461 (+/-96.744) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 100}
-14.430 (+/-10.858) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 100}
-14.742 (+/-37.705) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 100}
-7.915 (+/-10.308) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 100}
-1.592 (+/-1.513) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 100}
-1.543 (+/-3.654) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1000}
-4.629 (+/-10.477) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1000}
-65.690 (+/-92.825) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 1000}
-2745.336 (+/-4173.978) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 1000}
-248.269 (+/-312.776) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 1000}
-65.826 (+/-132.946) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 1000}
-28.569 (+/-64.979) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 1000}
-6.955 (+/-8.647) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 1000}
-3.647 (+/-7.858) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10000}
-12.712 (+/-29.380) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10000}
-1094.270 (+/-2262.303) for {'gamma': 0.01, 'kernel': 'rbf', 'C': 10000}
-3698.268 (+/-8085.389) for {'gamma': 0.1, 'kernel': 'rbf', 'C': 10000}
-2079.620 (+/-3651.872) for {'gamma': 0.2, 'kernel': 'rbf', 'C': 10000}
-70.982 (+/-159.707) for {'gamma': 0.5, 'kernel': 'rbf', 'C': 10000}
-89.859 (+/-180.071) for {'gamma': 0.6, 'kernel': 'rbf', 'C': 10000}
-661.291 (+/-1636.522) for {'gamma': 0.9, 'kernel': 'rbf', 'C': 10000}
现在GridSearchCV给出了最佳参数C:1,gamma:0.0001 但我检查了参数应该是C:1000,gamma:0.5
现在我的问题是
编辑:我也在添加有关如何找到正确参数的代码。我只是试图将所有参数都放在SVR中并且均方误差。
# Working parameters
svr = SVR(kernel='rbf', C=1e3, gamma = 0.5, epsilon = 0.01)
y_rbf = svr.fit(X, y).predict(X)
# Plotting
plt.figure(1)
plt.plot(X, y_rbf, c = 'navy', label = 'Predicted')
plt.legend()
# Checking prediction error
print("Mean squared error: %.2f" % mean_squared_error(true, y_rbf))
以上参数的图表在链接中, https://imgur.com/a/cmwPz
GridSearchCV的绘图选择参数 https://imgur.com/a/R1OAs
答案 0 :(得分:1)
有几件事在这里发挥了重要作用:
1)GridSearch用于找到最佳参数的评分标准。由于您没有为GridSearchCV的得分参数提供任何值,因此将使用SVR的评分方法,即R平方值,而不是像您所做的那样mean_squared_error。
这可以通过这样做来解决:
from sklearn.metrics import make_scorer
scorer = make_scorer(mean_squared_error, greater_is_better=False)
svr_gs = GridSearchCV(SVR(epsilon = 0.01), parameters, cv = K, scoring=scorer)
2)GridSearch用于培训的数据量。网格搜索将数据拆分为火车并使用提供的cv进行测试(在您的情况下K = 5,因此将使用5倍的方法)。这意味着网格搜索将训练列车数据上的SVR并计算测试数据的得分,而不是像您一样计算整个数据。这将导致答案的变化。对于K = 5,一次只有80%的数据用于训练。这意味着数据少于您正在做的数据。
可以通过将K的值增加到15或20或25来解决。
完成这两项更改后,我得到的是: