如何编码分类数据以用于半监督算法LabelPropagation

时间:2017-06-02 01:02:20

标签: machine-learning scikit-learn categorical-data

我试图将退火.arff数据集与Python scikit-learn的半监督算法LabelPropagation一起使用。退火数据集是分类数据,因此我对其进行了预处理,以便为每个实例项输出类 看起来像[0。 0. 1. 0. 0。这是一个编码输出类的数字列表 作为5个可能的值,0到处都是,并且1.在相应类的位置。这就是我所期望的。

对于半监督学习,大多数训练数据必须是未标记的,所以 我修改了训练集,以便未标记的数据输出[-1,-1,-1,-1,-1]。我之前尝试使用-1,但代码发出的错误如下所示。

我按如下方式训练分类器,Y_train包括标记的和“未标记的”数据:

lp_model = LabelSpreading(gamma=0.25, max_iter=5)
lp_model.fit(X, Y_train)

调用fit方法后,我收到如下所示的错误:

File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\semi_supervised\label_propagation.py", line 221, in fit
    X, y = check_X_y(X, y)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 526, in check_X_y
    y = column_or_1d(y, warn=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (538, 5)

这表明我的Y_train列表的形状有问题, 但这是正确的形状。我做错了什么?

LabelPropagation可以将此形式作为训练数据,还是仅作为训练数据 接受未标记的数据作为标量-1?

---编辑---
这是生成错误的代码。对于算法的混淆我很抱歉 - 我想同时使用LabelSpreading和LabelPropagation,选择其中一个并不能解决这个错误。

from scipy.io import arff
import pandas as pd
import numpy as np
import math

from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from copy import deepcopy

from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading


f = "../../Documents/UCI/anneal.arff"
dataAsRecArray, meta = arff.loadarff(f)
dataset_raw = pd.DataFrame.from_records(dataAsRecArray)
dataset = pd.get_dummies(dataset_raw)
class_names = [col for col in dataset.columns if 'class_' in col]
print (dataset.shape)
number_of_output_columns = len(class_names)
print (number_of_output_columns)


def run(name, model, dataset, percent):
    # Split-out validation dataset
    array = dataset.values
    X = array[:, 0:-number_of_output_columns]
    Y = array[:, -number_of_output_columns:]
    validation_size = 0.40
    seed = 7
    X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size,                                                                                random_state=seed)
    num_samples = len(Y_train)

    num_labeled_points = math.floor(percent*num_samples)

    indices = np.arange(num_samples)
    unlabeled_set = indices[num_labeled_points:]

   Y_train[unlabeled_set] = [-1, -1, -1, -1, -1]
   lp_model = LabelSpreading(gamma=0.25, max_iter=5)
   lp_model.fit(X_train, Y_train)
   """
   predicted_labels = lp_model.transduction_[unlabeled_set]

   print(predicted_labels[:10])
"""
if __name__ == "__main__":
    #percentages = [0.1, 0.2, 0.3, 0.4]
    percentages = [0.1]

    models = []

    models.append(('LS', LabelSpreading()))
    #models.append(('CART', DecisionTreeClassifier()))
    #models.append(('NB', GaussianNB()))
    #models.append(('SVM', SVC()))
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
        for percent in percentages:
            run(name, model, dataset, percent)
    print ("bye")

1 个答案:

答案 0 :(得分:0)

你的Y_train有形状(538,5),但应该是1d。 LabelPropagation现在不支持多标签或多输出多类。 但错误信息可能会提供更多信息: - /