我想使用MultiOutputClassifier
在Python中为scikit learn
编写代码。我有文本值,所以我使用了CountVectorizer()
,并且我想为模型找到最佳参数,所以我使用了GridSearchCV
和model.best_params_
。
决策树和MultiOutputClassifier的最佳参数。
我得到了错误,但是我不知道如何解决它,我到处都是:
ValueError: Invalid parameter criterion for estimator MultiOutputClassifier(estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'),
n_jobs=None). Check the list of available parameters with `estimator.get_params().keys()`.
如何解决此错误? 这是完整的代码:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score
df = pd.DataFrame({"first":["yes", "no", "yes", "yes", "no"],
"second":["yes", "no", "no", "yes", "yes"],
"third":["true","true", "false", "true", "false"]})
#print(df)
features = df.iloc[:,-1]
results = df.iloc[:,:-1]
cv = CountVectorizer()
features = cv.fit_transform(features)
features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)
tuned_tree = {'criterion':['entropy','gini'], 'random_state':[1,2,3,4,5,6,7,8,9,10,11,12,13]}
cls = GridSearchCV(MultiOutputClassifier(tree.DecisionTreeClassifier()), tuned_tree)
model = cls.fit(features_train, result_train)
acc_prediction = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)
print(accuracy_test, model.best_params_)
答案 0 :(得分:0)
您需要使用estimator__
前缀设置MultiOutputClassifier的参数。
尝试一下
{'estimator__criterion':['entropy','gini']}
注意:出于任何原因,您都不应该调整random_state。只为您提供可复制性。
您需要对标签(目标变量)进行二值化处理才能在多标签设置中计算指标。
对于多标签格式,sklearn中未定义分层的火车测试拆分。因此,您必须对训练测试进行随机分割,然后应用二值化。
sklearn中有很多指标可用于解决多标签问题,请检查this。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn import preprocessing
df = pd.DataFrame({"first":["yes", "no", "yes", "yes", "no"],
"second":["yes", "no", "no", "yes", "yes"],
"third":["true","true", "false", "true", "false"]})
train, test = train_test_split(
df, test_size = 0.3, random_state = 42)
# vectorization
cv = CountVectorizer()
# always fit the vectorizer on the train data alone
# fitting on complete data leads to data leakage
features_train_vect = cv.fit_transform(train.iloc[:,-1])
# label binarization
mlb = preprocessing.MultiLabelBinarizer()
result_train = mlb.fit_transform(train.iloc[:,:-1].values)
# applying the transform in test data
result_test = mlb.transform(test.iloc[:,:-1].values)
features_test_vect = cv.transform(test.iloc[:,-1])
params_range = {'estimator__criterion':['entropy','gini']}
cls = GridSearchCV(MultiOutputClassifier(tree.DecisionTreeClassifier(random_state=1),),
params_range, cv=3)
model = cls.fit(features_train_vect, result_train)
f1_score(cls.predict(features_test_vect), result_test, average='weighted')
# 0.6666666666666666
答案 1 :(得分:0)
您正在将DecisionTreeClassifier()
构造函数传递给MultiOutputClassifier
。尝试实例化决策树 estimator对象,并将其传递给函数:
dtc = tree.DecisionTreeClassifier()
cls = GridSearchCV(MultiOutputClassifier(dtc), tuned_tree)
答案 2 :(得分:0)
传递的字典应该像
tuned_tree = {'estimator__criterion':['entropy','gini'], 'estimator__random_state':[1,2,3,4,5,6,7,8,9,10,11,12,13]}
所有参数都必须使用estimator__
前缀