我正面临使用Pipeline和GridSearchCV查找平均平均误差(MAE)的挑战
背景:
我从事过一个数据科学项目(如下所示的MWE),其中将返回分类器的MAE值作为其性能指标。
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling
RF_model = RandomForestClassifier(n_estimators=100, random_state=0)
RF_model.fit(X_train, y_train)
#RandomForest Prediction
y_predict = RF_model.predict(X_valid)
#MAE
print(mean_absolute_error(y_valid, y_predict))
#Output:
# 0.38727149627623564
挑战:
现在,我正在尝试使用Pipeline和GridSearchCV(如下所示的MWE)来实现相同的功能。期望的是将返回与上述相同的MAE值。不幸的是,我无法使用下面的3种方法来解决问题。
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling via Pipeline and Hyper-parameter tuning
steps = [('rf', RandomForestClassifier(random_state=0))]
pipeline = Pipeline(steps) # define the pipeline object.
parameters = {'rf__n_estimators':[100]}
grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)
grid.fit(X_train, y_train)
#Approach 1:
print(grid.best_score_)
# Output:
# -0.508130081300813
#Approach 2:
y_predict=grid.predict(X_valid)
print("score = %3.2f"%(grid.score(y_predict, y_valid)))
# Output:
# ValueError: Expected 2D array, got 1D array instead:
# array=[0. 0. 0. ... 0. 1. 0.].
# Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
#Approach 3:
y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])
print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))
# Output:
# ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1
讨论:
方法1:
与GridSearchCV()
中的scoring
变量设置为neg_mean_squared_error
一样,试图读取grid.best_score_
。但是它没有得到相同的MAE结果。
方法2:
尝试使用y_predict
获得grid.predict(X_valid)
的值。然后尝试使用grid.score(y_predict, y_valid)
来获得MAE,因为scoring
中的GridSearchCV()
变量被设置为neg_mean_squared_error
。它返回ValueError
并抱怨“预期的2D数组,取而代之的是1D数组”。
方法3:
试图重塑y_predict
,但也没有用。这次返回“ ValueError:模型的特征数量必须与输入匹配。”
如果您可以帮助指出我可能在哪里犯了错误,这会很有帮助?
如果需要,可以在https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv上找到data.csv
非常感谢您
答案 0 :(得分:1)
您正在尝试将mean_absolute_error
与neg_mean_squared_error
进行比较,这有很大的不同,请参考here以获取更多详细信息。您应该在neg_mean_absolute_error
对象创建过程中使用GridSearchCV
,如下所示:
grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)
此外,sklearn中的score方法以(X,y)
作为输入,其中x
是形状为(n_samples, n_features)
的输入特征,而y
是目标标签,您需要将grid.score(y_predict, y_valid)
更改为grid.score(X_valid, y_valid)
。
希望这会有所帮助。