岭回归中的alpha参数很高

时间:2015-08-13 05:13:32

标签: python pandas scikit-learn

我正在使用来自病态学习的岭线性回归。在文档中,他们声明alpha参数必须很小。

然而,我在6060获得了最好的模特表现。我做错了吗?

以下是文档中的说明:

alpha : {float, array-like} shape = [n_targets] Small positive values
of alpha improve the conditioning of the problem and reduce the
variance of the estimates.

这是我的代码:

import pandas as pd
import numpy as np
import custom_metrics as cmetric
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn import linear_model

# Read data files:
df_train = pd.read_csv(path + "/input/train.csv")
df_test  = pd.read_csv(path + "/input/test.csv")

#print df.shape
#(50999, 34)

#convert categorical features into integers
feature_cols_obj = [col for col in df_train.columns if df_train[col].dtypes == 'object']
le = preprocessing.LabelEncoder()
for col in feature_cols_obj:
    df_train[col] = le.fit_transform(df_train[col])
    df_test[col] = le.transform(df_test[col])

#Scale the data so that each feature has zero mean and unit std
feature_cols = [col for col in df_train.columns if col not in ['Hazard','Id']]
scaler = preprocessing.StandardScaler().fit(df_train[feature_cols])
df_train[feature_cols] = scaler.transform(df_train[feature_cols])                               
df_test[feature_cols] = scaler.transform(df_test[feature_cols]) 

#polynomial features/interactions
X_train = df_train[feature_cols]
X_test = df_test[feature_cols]
y = df_train['Hazard']
test_ids = df_test['Id']
poly = preprocessing.PolynomialFeatures(2)
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)

#do grid search to find best value for alpha
#alphas = np.arange(-10,3,1)        
#clf = linear_model.RidgeCV(10**alphas)
alphas = np.arange(100,10000,10)        
clf = linear_model.RidgeCV(alphas)
clf.fit(X_train, y)
print clf.alpha_  
#clf.alpha=6060

cv = cross_validation.KFold(df_train.shape[0], n_folds=10)
mse = []
mse_train = []
fold_count = 0
for train, test in cv:
    print("Processing fold %s" % fold_count)
    train_fold = df_train.ix[train]
    test_fold = df_train.ix[test]

    # Get training examples
    X_train = train_fold[feature_cols]
    y = train_fold['Hazard']
    X_test = test_fold[feature_cols]
    #interactions
    poly = preprocessing.PolynomialFeatures(2)
    X_train = poly.fit_transform(X_train)
    X_test = poly.fit_transform(X_test)

    # Fit Ridge linear regression 
    cfr = linear_model.Ridge (alpha = 6060)
    cfr.fit(X_train, y)

    # Check error on test set
    pred = cfr.predict(X_test)

    mse.append(cmetric.normalized_gini(test_fold.Hazard, pred))

    # Check error on training set (Resubsitution error)
    mse_train.append(cmetric.normalized_gini(y, cfr.predict(X_train)))    

    # Done with the fold
    fold_count += 1

    #print model coeff

print cfr.coef_

print pd.DataFrame(mse).mean()
#0.311794
print pd.DataFrame(mse_train).mean()
#.344775

这是一组数据的统计描述: 在多项式特征之前:

              T1_V1         T1_V2         T1_V3         T1_V4         T1_V5  \
count  45899.000000  45899.000000  45899.000000  45899.000000  45899.000000   
mean      -0.000731     -0.001736      0.000183     -0.001917      0.000392   
std        1.000116      0.999538      1.000170      1.000554      0.999491   
min       -1.687746     -1.893892     -1.256792     -1.394844     -1.330461   
25%       -0.720234     -0.934764     -0.681865     -0.978753     -1.008006   
50%       -0.139727      0.184219     -0.106938      0.685608      0.281812   
75%        0.827786      0.823638      0.467988      0.685608      1.249175   
max        1.795298      1.782766      3.342622      1.517788      1.571630   

              T1_V6         T1_V7         T1_V8         T1_V9        T1_V10  \
count  45899.000000  45899.000000  45899.000000  45899.000000  45899.000000   
mean       0.000085      0.000574     -0.000776      0.001024     -0.000792   
std        1.000021      1.001709      0.999421      0.999460      0.999491   
min       -0.886738     -2.559151     -2.426625     -2.894427     -1.396415   
25%       -0.886738     -0.188322     -0.199566     -0.499280     -1.118270   
50%       -0.886738     -0.188322     -0.199566     -0.499280      0.272457   
75%        1.127729     -0.188322     -0.199566      0.698293      0.272457   
max        1.127729      4.553336      4.254553      3.093439      1.385038   

           ...              T2_V6         T2_V7         T2_V8         T2_V9  \
count      ...       45899.000000  45899.000000  45899.000000  45899.000000   
mean       ...          -0.000248     -0.002250      0.002158     -0.002376   
std        ...           1.000600      1.000546      1.009264      1.000567   
min        ...          -1.185107     -1.969111     -0.164560     -1.571220   
25%        ...           0.064723     -0.426425     -0.164560     -0.887667   
50%        ...           0.064723      0.087804     -0.164560      0.206019   
75%        ...           0.064723      1.116261     -0.164560      0.752862   
max        ...           6.313873      1.116261     10.045186      1.709837   

             T2_V10        T2_V11        T2_V12        T2_V13        T2_V14  \
count  45899.000000  45899.000000  45899.000000  45899.000000  45899.000000   
mean      -0.000526     -0.003068      0.000881     -0.003165     -0.000713   
std        0.999744      1.001545      1.000736      1.001126      0.999412   
min       -1.843477     -1.620956     -0.472133     -1.756894     -1.151631   
25%       -0.789013     -1.620956     -0.472133     -0.488816     -0.358019   
50%       -0.261781      0.616920     -0.472133      0.779261     -0.358019   
75%        0.792683      0.616920     -0.472133      0.779261      0.435593   
max        1.319915      0.616920      2.118047      0.779261      3.610041   

             T2_V15  
count  45899.000000  
mean      -0.001722  
std        0.998565  
min       -0.807511  
25%       -0.807511  
50%       -0.482489  
75%        0.492577  
max        2.767731  

[8 rows x 32 columns]

在多项式特征之后:

         0             1             2             3             4    \
count  45899  45899.000000  45899.000000  45899.000000  45899.000000   
mean       1     -0.000731     -0.001736      0.000183     -0.001917   
std        0      1.000116      0.999538      1.000170      1.000554   
min        1     -1.687746     -1.893892     -1.256792     -1.394844   
25%        1     -0.720234     -0.934764     -0.681865     -0.978753   
50%        1     -0.139727      0.184219     -0.106938      0.685608   
75%        1      0.827786      0.823638      0.467988      0.685608   
max        1      1.795298      1.782766      3.342622      1.517788   

                5             6             7             8             9    \
count  45899.000000  45899.000000  45899.000000  45899.000000  45899.000000   
mean       0.000392      0.000085      0.000574     -0.000776      0.001024   
std        0.999491      1.000021      1.001709      0.999421      0.999460   
min       -1.330461     -0.886738     -2.559151     -2.426625     -2.894427   
25%       -1.008006     -0.886738     -0.188322     -0.199566     -0.499280   
50%        0.281812     -0.886738     -0.188322     -0.199566     -0.499280   
75%        1.249175      1.127729     -0.188322     -0.199566      0.698293   
max        1.571630      1.127729      4.553336      4.254553      3.093439   

           ...                551           552           553           554  \
count      ...       45899.000000  45899.000000  45899.000000  45899.000000   
mean       ...           1.001451      0.231269      0.019758     -0.015785   
std        ...           1.647125      0.796845      1.026707      0.910075   
min        ...           0.222910     -3.721184     -2.439209     -1.710345   
25%        ...           0.222910     -0.367915     -0.580348     -0.386016   
50%        ...           0.222910     -0.068564      0.169033      0.227799   
75%        ...           0.222910      0.829488      0.169033      0.381252   
max        ...           4.486123      1.650512      7.646235      5.862185   

                555           556           557           558           559  \
count  45899.000000  45899.000000  45899.000000  45899.000000  45899.000000   
mean       1.002242     -0.072864      0.006086      0.998802     -0.013314   
std        1.070157      1.007916      0.953547      1.768235      0.949678   
min        0.021090     -6.342458     -4.862610      0.128178     -3.187406   
25%        0.607248     -0.278991     -0.629262      0.128178     -0.351746   
50%        0.607248     -0.278991     -0.117269      0.189741      0.072986   
75%        0.607248      0.339440      0.394724      1.326255      0.289104   
max        3.086676      2.813165      2.156786     13.032392      9.991622   

                560  
count  45899.000000  
mean       0.997114  
std        1.573975  
min        0.024796  
25%        0.232795  
50%        0.652073  
75%        0.652073  
max        7.660336  

这是alpha的cv_values:

clf = linear_model.RidgeCV(store_cv_values =True)
clf.fit(X_train, y)
print clf.cv_values_  
[[  2.66305438e+00   2.66309171e+00   2.66347365e+00]
 [  1.54423791e+00   1.54415884e+00   1.54339859e+00]
 [  6.67823810e+00   6.67822709e+00   6.67821319e+00]
 ..., 
 [  1.30064559e-02   1.30216638e-02   1.31734569e-02]
 [  2.75705381e+01   2.75705980e+01   2.75713343e+01]
 [  9.88136940e+00   9.88182038e+00   9.88626893e+00]]

1 个答案:

答案 0 :(得分:1)

这可能是overfitting的标志;您可能希望减少功能集。

将回归量拟合到训练集时,会使用某些要素来拟合要素集中的随机变化。当在样本外进行测试时(例如通过k折验证),拟合质量将会很差,因为额外的特征是适合噪声而不是中心趋势。较高的alpha值有助于将这些系数驱动为零,从而降低过度拟合的程度。

您可能希望修剪您的要素集(消除输入数据中的某些列),可能只需要使用岭算法重度加权的项开始。另一种选择是使用lasso回归量,它将小系数驱动为零。然而,套索算法并不是一个完美的解决方案,因为它也容易过度拟合。