所有子集上的岭回归均方根均高于总集合

时间:2018-11-21 10:37:50

标签: python scikit-learn regression

我在一个集合上训练了一个模型,并试图在所有子集中使用它。

在数学上,总的rmse和mae(平均错误)应介于单个rsme和mae之间。但是所有的均方根值和总和均值均高于总和。

我做了以下事情:

%pyspark
def preprocessing(features, attributes):

    features_2 = features[attributes]
    y = features['y'].values
    x = features_2.values 

    robustScaler = RobustScaler(quantile_range=(25.0,75.0))
    xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])

    xScaled[xScaled < -2.0] = -2.0 
    xScaled[xScaled > 2.0] = 2.0
    xCustomers = x[:,0]
    xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1)) 
    x_TS = xScaled 
    x_T0 = xScaled[:,:] 
    x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3)) 
    xCustR = xCustomers.reshape((x[:,0].size, 1)) 
    x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3))) 
    x_all = np.hstack((x_T0_all, x_TS_all))
    variable_names = features_2.columns.get_values()[1:].tolist() 
    return x_all, variable_names, y

def trainModel(features,attributes,optAlpha):
    x_all, variable_names, y = preprocessing(features, attributes)
    ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')
    ridge.fit(x_all, y)
    return ridge

def useModel(features,ridge,attributes):
    x_all, variable_names, y = preprocessing(features, attributes)
    y_pred = ridge.predict(x_all)
    rmse = np.sqrt(mean_squared_error(y,y_pred))
    mae = mean_absolute_error(y, y_pred)    
    print "RMSE on test set: ", round(rmse,2)
    print "MAE on test set:  ", round(mae,2)
    return y_pred, y, rmse, mae

ridge = trainModel(df_features_train, attributes, optAlpha)
useModel(df_features_train,ridge,attributes)

RMSE on test set:  67.05
MAE on test set:   52.5

现在,我尝试单独使用useModel-function,包括对所有不同的orgID进行预处理。

orgIDError = pd.DataFrame([],columns=['orgID','rmse','mae'])

for orgID in df_features['orgID'].unique():
    yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)
    df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])
    orgIDError = orgIDError.append(df)
print(orgIDError)

   orgID       rmse          mae
0  615   194.848564   155.502885
0  577   101.156573    76.083797
0  957  1564.256952   814.316566
0  763   832.782755   501.865561
0  616  1337.456555   860.404253
0  968   526.207558   347.265139
0  954  1570.315284  1149.191017
0  874   241.254153   202.429037
0  554   402.013992   344.846957
0  950  1073.348186   673.874603

任何想法出了什么问题?

1 个答案:

答案 0 :(得分:1)

我找到了自己。

预处理中的robustScaler在不同的集合/子集上的工作方式不同。

因此,子集中的值准备不同,因此不再适合模型。