我正在使用来自病态学习的岭线性回归。在文档中,他们声明alpha参数必须很小。
然而,我在6060获得了最好的模特表现。我做错了吗?
以下是文档中的说明:
alpha : {float, array-like} shape = [n_targets] Small positive values
of alpha improve the conditioning of the problem and reduce the
variance of the estimates.
这是我的代码:
import pandas as pd
import numpy as np
import custom_metrics as cmetric
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn import linear_model
# Read data files:
df_train = pd.read_csv(path + "/input/train.csv")
df_test = pd.read_csv(path + "/input/test.csv")
#print df.shape
#(50999, 34)
#convert categorical features into integers
feature_cols_obj = [col for col in df_train.columns if df_train[col].dtypes == 'object']
le = preprocessing.LabelEncoder()
for col in feature_cols_obj:
df_train[col] = le.fit_transform(df_train[col])
df_test[col] = le.transform(df_test[col])
#Scale the data so that each feature has zero mean and unit std
feature_cols = [col for col in df_train.columns if col not in ['Hazard','Id']]
scaler = preprocessing.StandardScaler().fit(df_train[feature_cols])
df_train[feature_cols] = scaler.transform(df_train[feature_cols])
df_test[feature_cols] = scaler.transform(df_test[feature_cols])
#polynomial features/interactions
X_train = df_train[feature_cols]
X_test = df_test[feature_cols]
y = df_train['Hazard']
test_ids = df_test['Id']
poly = preprocessing.PolynomialFeatures(2)
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)
#do grid search to find best value for alpha
#alphas = np.arange(-10,3,1)
#clf = linear_model.RidgeCV(10**alphas)
alphas = np.arange(100,10000,10)
clf = linear_model.RidgeCV(alphas)
clf.fit(X_train, y)
print clf.alpha_
#clf.alpha=6060
cv = cross_validation.KFold(df_train.shape[0], n_folds=10)
mse = []
mse_train = []
fold_count = 0
for train, test in cv:
print("Processing fold %s" % fold_count)
train_fold = df_train.ix[train]
test_fold = df_train.ix[test]
# Get training examples
X_train = train_fold[feature_cols]
y = train_fold['Hazard']
X_test = test_fold[feature_cols]
#interactions
poly = preprocessing.PolynomialFeatures(2)
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)
# Fit Ridge linear regression
cfr = linear_model.Ridge (alpha = 6060)
cfr.fit(X_train, y)
# Check error on test set
pred = cfr.predict(X_test)
mse.append(cmetric.normalized_gini(test_fold.Hazard, pred))
# Check error on training set (Resubsitution error)
mse_train.append(cmetric.normalized_gini(y, cfr.predict(X_train)))
# Done with the fold
fold_count += 1
#print model coeff
print cfr.coef_
print pd.DataFrame(mse).mean()
#0.311794
print pd.DataFrame(mse_train).mean()
#.344775
这是一组数据的统计描述: 在多项式特征之前:
T1_V1 T1_V2 T1_V3 T1_V4 T1_V5 \
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000
mean -0.000731 -0.001736 0.000183 -0.001917 0.000392
std 1.000116 0.999538 1.000170 1.000554 0.999491
min -1.687746 -1.893892 -1.256792 -1.394844 -1.330461
25% -0.720234 -0.934764 -0.681865 -0.978753 -1.008006
50% -0.139727 0.184219 -0.106938 0.685608 0.281812
75% 0.827786 0.823638 0.467988 0.685608 1.249175
max 1.795298 1.782766 3.342622 1.517788 1.571630
T1_V6 T1_V7 T1_V8 T1_V9 T1_V10 \
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000
mean 0.000085 0.000574 -0.000776 0.001024 -0.000792
std 1.000021 1.001709 0.999421 0.999460 0.999491
min -0.886738 -2.559151 -2.426625 -2.894427 -1.396415
25% -0.886738 -0.188322 -0.199566 -0.499280 -1.118270
50% -0.886738 -0.188322 -0.199566 -0.499280 0.272457
75% 1.127729 -0.188322 -0.199566 0.698293 0.272457
max 1.127729 4.553336 4.254553 3.093439 1.385038
... T2_V6 T2_V7 T2_V8 T2_V9 \
count ... 45899.000000 45899.000000 45899.000000 45899.000000
mean ... -0.000248 -0.002250 0.002158 -0.002376
std ... 1.000600 1.000546 1.009264 1.000567
min ... -1.185107 -1.969111 -0.164560 -1.571220
25% ... 0.064723 -0.426425 -0.164560 -0.887667
50% ... 0.064723 0.087804 -0.164560 0.206019
75% ... 0.064723 1.116261 -0.164560 0.752862
max ... 6.313873 1.116261 10.045186 1.709837
T2_V10 T2_V11 T2_V12 T2_V13 T2_V14 \
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000
mean -0.000526 -0.003068 0.000881 -0.003165 -0.000713
std 0.999744 1.001545 1.000736 1.001126 0.999412
min -1.843477 -1.620956 -0.472133 -1.756894 -1.151631
25% -0.789013 -1.620956 -0.472133 -0.488816 -0.358019
50% -0.261781 0.616920 -0.472133 0.779261 -0.358019
75% 0.792683 0.616920 -0.472133 0.779261 0.435593
max 1.319915 0.616920 2.118047 0.779261 3.610041
T2_V15
count 45899.000000
mean -0.001722
std 0.998565
min -0.807511
25% -0.807511
50% -0.482489
75% 0.492577
max 2.767731
[8 rows x 32 columns]
在多项式特征之后:
0 1 2 3 4 \
count 45899 45899.000000 45899.000000 45899.000000 45899.000000
mean 1 -0.000731 -0.001736 0.000183 -0.001917
std 0 1.000116 0.999538 1.000170 1.000554
min 1 -1.687746 -1.893892 -1.256792 -1.394844
25% 1 -0.720234 -0.934764 -0.681865 -0.978753
50% 1 -0.139727 0.184219 -0.106938 0.685608
75% 1 0.827786 0.823638 0.467988 0.685608
max 1 1.795298 1.782766 3.342622 1.517788
5 6 7 8 9 \
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000
mean 0.000392 0.000085 0.000574 -0.000776 0.001024
std 0.999491 1.000021 1.001709 0.999421 0.999460
min -1.330461 -0.886738 -2.559151 -2.426625 -2.894427
25% -1.008006 -0.886738 -0.188322 -0.199566 -0.499280
50% 0.281812 -0.886738 -0.188322 -0.199566 -0.499280
75% 1.249175 1.127729 -0.188322 -0.199566 0.698293
max 1.571630 1.127729 4.553336 4.254553 3.093439
... 551 552 553 554 \
count ... 45899.000000 45899.000000 45899.000000 45899.000000
mean ... 1.001451 0.231269 0.019758 -0.015785
std ... 1.647125 0.796845 1.026707 0.910075
min ... 0.222910 -3.721184 -2.439209 -1.710345
25% ... 0.222910 -0.367915 -0.580348 -0.386016
50% ... 0.222910 -0.068564 0.169033 0.227799
75% ... 0.222910 0.829488 0.169033 0.381252
max ... 4.486123 1.650512 7.646235 5.862185
555 556 557 558 559 \
count 45899.000000 45899.000000 45899.000000 45899.000000 45899.000000
mean 1.002242 -0.072864 0.006086 0.998802 -0.013314
std 1.070157 1.007916 0.953547 1.768235 0.949678
min 0.021090 -6.342458 -4.862610 0.128178 -3.187406
25% 0.607248 -0.278991 -0.629262 0.128178 -0.351746
50% 0.607248 -0.278991 -0.117269 0.189741 0.072986
75% 0.607248 0.339440 0.394724 1.326255 0.289104
max 3.086676 2.813165 2.156786 13.032392 9.991622
560
count 45899.000000
mean 0.997114
std 1.573975
min 0.024796
25% 0.232795
50% 0.652073
75% 0.652073
max 7.660336
这是alpha的cv_values:
clf = linear_model.RidgeCV(store_cv_values =True)
clf.fit(X_train, y)
print clf.cv_values_
[[ 2.66305438e+00 2.66309171e+00 2.66347365e+00]
[ 1.54423791e+00 1.54415884e+00 1.54339859e+00]
[ 6.67823810e+00 6.67822709e+00 6.67821319e+00]
...,
[ 1.30064559e-02 1.30216638e-02 1.31734569e-02]
[ 2.75705381e+01 2.75705980e+01 2.75713343e+01]
[ 9.88136940e+00 9.88182038e+00 9.88626893e+00]]
答案 0 :(得分:1)
这可能是overfitting的标志;您可能希望减少功能集。
将回归量拟合到训练集时,会使用某些要素来拟合要素集中的随机变化。当在样本外进行测试时(例如通过k折验证),拟合质量将会很差,因为额外的特征是适合噪声而不是中心趋势。较高的alpha
值有助于将这些系数驱动为零,从而降低过度拟合的程度。
您可能希望修剪您的要素集(消除输入数据中的某些列),可能只需要使用岭算法重度加权的项开始。另一种选择是使用lasso
回归量,它将小系数驱动为零。然而,套索算法并不是一个完美的解决方案,因为它也容易过度拟合。