RandomForest得分方法ValueError

时间:2016-11-18 05:28:57

标签: python machine-learning scikit-learn random-forest unsupervised-learning

我试图找到某些训练数据的给定数据集的得分。我写了以下代码:

from sklearn.ensemble import RandomForestClassifier
import numpy as np

randomForest = RandomForestClassifier(n_estimators = 200)

li_train1 =  [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]

li_train2 =  [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]

li_text1 = [[10,20,30,40,50,60,70,80,90], [10,20,30,40,50,60,70,80,90]]

li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]

randomForest.fit(li_train1, li_train2)

output =  randomForest.score(li_train1, li_text1)

在编译并尝试运行程序时,我收到错误:

Traceback (most recent call last):
  File "trial.py", line 16, in <module>
    output =  randomForest.score(li_train1, li_text1)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 89, in _check_targets
    raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported

在检查与分数方法相关的文档时,它说:

score(X, y, sample_weight=None)
X : array-like, shape = (n_samples, n_features)
    Test samples.

y : array-like, shape = (n_samples) or (n_samples, n_outputs)
    True labels for X.

在我的例子中,X和y都是数组,2d数组。

我也经历过this问题,但我无法理解我哪里出错了。

修改

根据答案和随后的评论,我编写了如下程序:

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np

randomForest = RandomForestClassifier(n_estimators = 200)

mlb = MultiLabelBinarizer()

li_train1 =  [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]

li_train2 =  [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]

li_text1 = [100,200]

li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]

randomForest.fit(li_train1, li_train2)

output =  randomForest.score(li_train1, li_text1)

在此编辑后,我收到错误:

Traceback (most recent call last):
  File "trial.py", line 19, in <module>
    output =  randomForest.score(li_train1, li_text1)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 82, in _check_targets
    "".format(type_true, type_pred))
ValueError: Can't handle mix of binary and multiclass-multioutput

1 个答案:

答案 0 :(得分:0)

根据documentation

  

警告:目前,sklearn.metrics中的任何指标都不支持多输出多类分类任务。

得分方法会调用sklearn的准确度指标,但这并不支持您定义的多类,多输出分类问题。

如果您真的打算解决多类多输出问题,那么您的问题就不清楚了。如果这不是您的意图,那么您应该重新构建输入数组。

另一方面,如果你真的想解决这类问题,你只需要定义自己的评分函数。

<强>更新

由于您没有解决多类,多标签问题,因此您应该重新构建数据,使其看起来像这样:

from sklearn.ensemble import RandomForestClassifier

# training data
X =  [
    [1,2,3,4,5,6,7,8,9],
    [1,2,3,4,5,6,7,8,9]
]

y =  [0,1]

# fit the model
randomForest.fit(X,y)

# test data
Xtest =  [
    [1,2,0,4,5,6,0,8,9],
    [1,1,3,1,5,0,7,8,9]
]

ytest =  [0,1]

output =  randomForest.score(Xtest,ytest)
print(output) # 0.5