R和python之间交叉验证结果的差异

时间:2018-05-29 19:38:39

标签: r python-3.x scikit-learn linear-regression

我有一个像这样的数据框:

    log.comb  CDEM_TWI  Gruber_Ruggedness      dNBR  TC_Change_Sexton_Rel  \
0   8.714914  10.70240           0.626106  0.701591             -27.12220   
1   6.501334  10.65650           1.146360  0.693891             -35.52890   
2   8.946111  13.58910           1.146360  0.513136               7.00000   
3   8.955151   9.85036           1.126980  0.673891              13.81380   
4   7.751379   7.28264           0.000000  0.256136              10.06940   
5   8.895197   8.36555           0.000000  0.506000             -27.61340   
6   8.676571  12.92650           0.000000  0.600627             -44.48400   
7   8.562267  12.76980           0.519255  0.747009             -29.84790   
8   9.052766  11.81580           0.519255  0.808336             -29.00900   
9   9.133744   9.42046           0.484616  0.604891             -18.53550   
10  8.221441   9.53682           0.484616  0.817336             -21.39920   
11  8.398913  12.32050           0.519255  0.814745             -18.12080   
12  7.587468  11.08880           1.274430  0.590282              92.85710   
13  7.983136   8.95073           1.274430  0.316000             -10.34480   
14  9.044404  11.18440           0.698818  0.608600             -14.77000   
15  8.370293  11.96980           0.687634  0.323000              -9.60452   
16  7.938134  12.42380           0.709549  0.374027              36.53140   
17  8.183456  12.73490           1.439180  0.679627             -12.94420   
18  8.322246   9.61600           0.551689  0.642900              37.50000   
19  7.934997   7.77564           0.519255  0.690936             -25.29880   
20  9.049387  11.16000           0.519255  0.789064             -35.73880   
21  8.071323   6.17036           0.432980  0.574355             -22.43590   
22  6.418345   5.98927           0.432980  0.584991               4.34783   
23  7.950516   5.49527           0.422882  0.689009              25.22520   
24  6.355529   7.35982           0.432980  0.419045             -18.81920   
25  8.043683   5.18300           0.763596  0.582555              50.56180   
26  6.013468   5.34018           0.493781  0.241155              -3.01205   
27  7.961675   5.43264           0.493781  0.421527             -21.72290   
28  8.074614  11.94630           0.493781  0.451800              11.61620   
29  8.370570   6.34100           0.492384  0.550127             -12.50000   

    Pct_Pima  Sand._15cm  
0   75.62120     44.6667  
1   69.30690     41.8333  
2   59.47490     41.8333  
3   66.08800     41.5000  
4   34.31250     39.6667  
5   35.04750     39.2424  
6   62.32120     41.6667  
7   57.14320     43.3333  
8   57.35020     43.3333  
9   72.90980     41.0000  
10  57.61790     38.8333  
11  57.35020     39.8333  
12  69.30690     47.8333  
13  69.30690     47.3333  
14  76.58910     42.8333  
15  75.62120     45.3333  
16  76.69440     41.7727  
17  59.47090     37.8333  
18  61.10130     42.8333  
19  72.67650     38.1818  
20  57.35020     40.6667  
21  23.15380     48.0000  
22  17.15050     51.5000  
23   0.00000     47.5000  
24   6.67001     58.0000  
25  15.18050     54.8333  
26   5.89344     49.0000  
27   5.89344     49.1667  
28  13.18900     48.5000  
29  13.30450     49.0000 

我希望通过重复100次的10次交叉验证来运行线性模型。

在python中我这样做:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import r2_score

X = df[['CDEM_TWI', 'Gruber_Ruggedness', 'dNBR', 'TC_Change_Sexton_Rel', 'Pct_Pima', 'Sand._15cm']].copy()
y = df[['log.comb']].copy()

all_r2 = []
rskf = RepeatedKFold(n_splits=10, n_repeats=10, random_state=42)
for train_index, test_index in rskf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    lm = LinearRegression(fit_intercept = True)
    lm.fit(X_train, y_train)
    pred = lm.predict(X_test)
    r2 = r2_score(y_test, pred)
    all_r2.append(r2)

avg = np.mean(all_r2)

此处avg返回-0.11

在R中我这样做:

library(caret)
library(klaR)

train_control <- trainControl(method="repeatedcv", number=10, repeats=10)
model <- train(log.comb~., data=df, trControl=train_control, method="lm")

model返回:

RMSE       Rsquared   MAE      
0.7868838  0.6132806  0.7047198

我很好奇为什么这些结果彼此如此不一致?我意识到两种不同语言之间的折叠是不同的,但由于我重复这么多次,我不明白为什么数字不是更相似。

我也尝试过sklearn中的嵌套网格搜索,如下所示:

inner_cv = KFold(n_splits=10, shuffle=True, random_state=10)

outer_cv = KFold(n_splits=10, shuffle=True, random_state=10)

param_grid = {'fit_intercept': [True, False],
                  'normalize': [True, False]}

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=LinearRegression(), param_grid = param_grid, cv=inner_cv)
clf.fit(X, y)
non_nested_score = clf.best_score_

# Pass the gridSearch estimator to cross_val_score
clf = GridSearchCV(estimator= LinearRegression(), param_grid = param_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv).mean()

nested_scorenon_nested_score都是否定的。

1 个答案:

答案 0 :(得分:1)

Python代码返回结果的平均值,R代码返回找到的最佳模型。