使用管道和GridSearchCV的MAE

时间:2020-03-21 05:36:48

标签: python pandas machine-learning scikit-learn data-science

我正面临使用Pipeline和GridSearchCV查找平均平均误差(MAE)的挑战

背景

我从事过一个数据科学项目(如下所示的MWE),其中将返回分类器的MAE值作为其性能指标。

#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity

#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#RandomForest Modeling
RF_model = RandomForestClassifier(n_estimators=100, random_state=0)
RF_model.fit(X_train, y_train)

#RandomForest Prediction
y_predict = RF_model.predict(X_valid)

#MAE 
print(mean_absolute_error(y_valid, y_predict))
#Output:
#   0.38727149627623564

挑战:

现在,我正在尝试使用Pipeline和GridSearchCV(如下所示的MWE)来实现相同的功能。期望的是将返回与上述相同的MAE值。不幸的是,我无法使用下面的3种方法来解决问题。

#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity

#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#RandomForest Modeling via Pipeline and Hyper-parameter tuning
steps = [('rf', RandomForestClassifier(random_state=0))]
pipeline = Pipeline(steps) # define the pipeline object.
parameters = {'rf__n_estimators':[100]}
grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)
grid.fit(X_train, y_train)

#Approach 1:
print(grid.best_score_)
# Output:
#    -0.508130081300813

#Approach 2:
y_predict=grid.predict(X_valid)
print("score = %3.2f"%(grid.score(y_predict, y_valid)))
# Output:
#    ValueError: Expected 2D array, got 1D array instead:
#    array=[0. 0. 0. ... 0. 1. 0.].
#    Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

#Approach 3:
y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])
print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))
# Output: 
#    ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1 

讨论:

方法1: 与GridSearchCV()中的scoring变量设置为neg_mean_squared_error一样,试图读取grid.best_score_。但是它没有得到相同的MAE结果。

方法2: 尝试使用y_predict获得grid.predict(X_valid)的值。然后尝试使用grid.score(y_predict, y_valid)来获得MAE,因为scoring中的GridSearchCV()变量被设置为neg_mean_squared_error。它返回ValueError并抱怨“预期的2D数组,取而代之的是1D数组”。

方法3: 试图重塑y_predict,但也没有用。这次返回“ ValueError:模型的特征数量必须与输入匹配。”

如果您可以帮助指出我可能在哪里犯了错误,这会很有帮助?

如果需要,可以在https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv上找到data.csv

非常感谢您

1 个答案:

答案 0 :(得分:1)

您正在尝试将mean_absolute_errorneg_mean_squared_error进行比较,这有很大的不同,请参考here以获取更多详细信息。您应该在neg_mean_absolute_error对象创建过程中使用GridSearchCV,如下所示:

grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)

此外,sklearn中的score方法以(X,y)作为输入,其中x是形状为(n_samples, n_features)的输入特征,而y是目标标签,您需要将grid.score(y_predict, y_valid)更改为grid.score(X_valid, y_valid)

希望这会有所帮助。