ValueError:不支持连续多输出

时间:2015-10-09 06:26:33

标签: python linear-regression cross-validation

我想在数据集上运行几种回归类型(Lasso,Ridge,ElasticNet和SVR),其中包含大约5,000行和6个要素。线性回归。使用GridSearchCV进行交叉验证。代码很广泛,但这里有一些关键部分:

def splitTrainTestAdv(df):

    y = df.iloc[:,-5:]  # last 5 columns
    X = df.iloc[:,:-5]  # Except for last 5 columns


    #Scaling and Sampling

    X = StandardScaler().fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)


return X_train, X_test, y_train, y_test

def performSVR(x_train, y_train, X_test, parameter):



    C = parameter[0]
    epsilon = parameter[1] 
    kernel = parameter[2]

    model = svm.SVR(C = C, epsilon = epsilon, kernel = kernel)
    model.fit(x_train, y_train)



return model.predict(X_test)  #prediction for the test

def performRidge(X_train, y_train, X_test, parameter):

    alpha = parameter[0]

    model = linear_model.Ridge(alpha=alpha, normalize=True) 
    model.fit(X_train, y_train)



return model.predict(X_test)  #prediction for the test

MODELS = {
    'lasso': (
        linear_model.Lasso(),
        {'alpha': [0.95]}
    ),
    'ridge': (
        linear_model.Ridge(),
        {'alpha': [0.01]}
        ),
    )
}


def performParameterSelection(model_name, feature, X_test, y_test, X_train, y_train):


    print("# Tuning hyper-parameters for %s" % feature)
    print()

    model, param_grid = MODELS[model_name]
    gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, scoring='%s_weighted' % feature)


    gs.fit(X_train, y_train) 


    print("Best parameters set found on development set:")

    print(gs.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in gs.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() * 2, params))

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")

    y_true, y_pred = y_test, gs.predict(X_test)
    print(classification_report(y_true, y_pred))

soil = pd.read_csv('C:/training.csv', index_col=0)
soil = getDummiedSoilDepth(soil)
np.random.seed(2015)
soil = shuffleData(soil)
soil = soil.drop('Depth', 1)

X_train, X_test, y_train, y_test = splitTrainTestAdv(soil)


scores = ['precision', 'recall']

for score in scores:




    for model in MODELS.keys():

        print '####################'
        print model, score
        print '####################'
        performParameterSelection(model, score, X_test, y_test, X_train, y_train)

您可以假设已完成所有必需的导入

我收到此错误但不知道原因:

ValueError                                Traceback (most recent call last)

in()          18打印模型,得分          19打印'####################'     ---> 20 performParameterSelection(model,score,X_test,y_test,X_train,y_train)          21

<ipython-input-27-304555776e21> in performParameterSelection(model_name,  feature, X_test, y_test, X_train, y_train)
     12     # cv=5 - constant; verbose - keep writing
     13 
---> 14     gs.fit(X_train, y_train) # Will get grid scores with outputs from ALL models described above
     15 
     16         #pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))

C:\Users\Tony\Anaconda\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)

C:\Users\Tony\Anaconda\lib\site-packages\sklearn\metrics\classification.pyc in _check_targets(y_true, y_pred)
     90     if (y_type not in ["binary", "multiclass", "multilabel-indicator",
     91                        "multilabel-sequences"]):
---> 92         raise ValueError("{0} is not supported".format(y_type))
     93 
     94     if y_type in ["binary", "multiclass"]:

ValueError: continuous-multioutput is not supported

我仍然是Python的新手,这个错误让我很困惑。这不应该因为我有6个功能,当然。我试图遵循标准的内置函数。

请帮助

3 个答案:

答案 0 :(得分:1)

首先让我们复制问题。

首先导入所需的库:

import numpy as np
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV

然后创建一些数据:

df = pd.DataFrame(np.random.rand(5000,11))
y = df.iloc[:,-5:]  # last 5 columns
X = df.iloc[:,:-5]  # Except for last 5 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)

现在我们可以复制错误并查看不会复制错误的选项:

运行正常

gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1)
print gs.fit(X_train, y_train) 

这不是

gs = GridSearchCV(linear_model.Lasso(), {'alpha': [0.95]}, n_jobs= 1, cv=5, verbose=1, scoring='recall')
gs.fit(X_train, y_train) 

确实错误与上面完全一样; &#39;不支持连续多输出&#39;。

如果您考虑召回措施,那么它与二进制或分类数据有关 - 我们可以定义哪些内容,例如误报等。至少在我复制数据的过程中,我使用的是连续数据并且没有定义召回。如果您使用默认分数,则可以使用,如上所示。

所以你可能需要查看你的预测并理解为什么它们是连续的(即使用分类器而不是回归)。或使用不同的分数。

顺便说一句,如果仅使用一组(列)y值运行回归,则仍会出现错误。这次它更简单地表示不支持连续输出&#39;,即问题是对连续数据使用召回(或精确)(无论是否是多输出)。

答案 1 :(得分:0)

最终目标是评估模型的性能,可以使用model.evaluate方法:

_,accuracy = model.evaluate(our_data_feat, new_label2,verbose=0.0)
print('Accuracy:%.2f'%(accuracy*100))

这将为您提供准确度值。

答案 2 :(得分:0)

确保因变量有单个系列。在 train_test_split 中正确拆分您的数据。