无法修复ValueError:MultiOutputClassifier和GridSearchCV的估计器的无效参数标准

时间:2019-07-22 05:33:56

标签: python scikit-learn

我想使用MultiOutputClassifier在Python中为scikit learn编写代码。我有文本值,所以我使用了CountVectorizer(),并且我想为模型找到最佳参数,所以我使用了GridSearchCVmodel.best_params_。 决策树和MultiOutputClassifier的最佳参数。

我得到了错误,但是我不知道如何解决它,我到处都是:

ValueError: Invalid parameter criterion for estimator MultiOutputClassifier(estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
           n_jobs=None). Check the list of available parameters with `estimator.get_params().keys()`.

如何解决此错误? 这是完整的代码:

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score

df = pd.DataFrame({"first":["yes", "no", "yes", "yes", "no"],
                  "second":["yes", "no", "no", "yes", "yes"],
                  "third":["true","true", "false", "true", "false"]})

#print(df)

features = df.iloc[:,-1]
results = df.iloc[:,:-1]

cv = CountVectorizer()  
features = cv.fit_transform(features)

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

tuned_tree = {'criterion':['entropy','gini'], 'random_state':[1,2,3,4,5,6,7,8,9,10,11,12,13]}

cls = GridSearchCV(MultiOutputClassifier(tree.DecisionTreeClassifier()), tuned_tree)
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test, model.best_params_)

3 个答案:

答案 0 :(得分:0)

您需要使用estimator__前缀设置MultiOutputClassifier的参数。

尝试一下

{'estimator__criterion':['entropy','gini']}

注意:出于任何原因,您都不应该调整random_state。只为您提供可复制性。

您需要对标签(目标变量)进行二值化处理才能在多标签设置中计算指标。

对于多标签格式,sklearn中未定义分层的火车测试拆分。因此,您必须对训练测试进行随机分割,然后应用二值化。

sklearn中有很多指标可用于解决多标签问题,请检查this

import pandas as pd  

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn import preprocessing


df = pd.DataFrame({"first":["yes", "no", "yes", "yes", "no"],
                  "second":["yes", "no", "no", "yes", "yes"],
                  "third":["true","true", "false", "true", "false"]})

train, test = train_test_split(
    df, test_size = 0.3, random_state = 42)

# vectorization
cv = CountVectorizer()  
# always fit the vectorizer on the train data alone
# fitting on complete data leads to data leakage

features_train_vect = cv.fit_transform(train.iloc[:,-1])

# label binarization
mlb = preprocessing.MultiLabelBinarizer()
result_train = mlb.fit_transform(train.iloc[:,:-1].values) 

# applying the transform in test data
result_test = mlb.transform(test.iloc[:,:-1].values)
features_test_vect = cv.transform(test.iloc[:,-1])


params_range = {'estimator__criterion':['entropy','gini']}


cls = GridSearchCV(MultiOutputClassifier(tree.DecisionTreeClassifier(random_state=1),),
                   params_range, cv=3)
model = cls.fit(features_train_vect, result_train)

f1_score(cls.predict(features_test_vect), result_test, average='weighted')
# 0.6666666666666666

答案 1 :(得分:0)

您正在将DecisionTreeClassifier() 构造函数传递给MultiOutputClassifier。尝试实例化决策树 estimator对象,并将其传递给函数:

dtc = tree.DecisionTreeClassifier()
cls = GridSearchCV(MultiOutputClassifier(dtc), tuned_tree)

答案 2 :(得分:0)

传递的字典应该像

tuned_tree = {'estimator__criterion':['entropy','gini'], 'estimator__random_state':[1,2,3,4,5,6,7,8,9,10,11,12,13]}

所有参数都必须使用estimator__前缀