使用MultiLabelBinarizer时如何获取类名

时间:2017-08-11 12:52:35

标签: python scikit-learn

我有一个csv文件,如下所示:

target,data
AAA,some text document
AAA;BBB,more text
AAC,more text

以下是代码:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.naive_bayes import BernoulliNB
import pandas as pd

pdf = pd.read_csv("Train.csv", sep=',')
pdfT = pd.read_csv("Test.csv", sep=',')

X1 = pdf['data']
Y1 = [[t for t in tar.split(';')] for tar in pdf['target']]
X2 = pdfT['data']
Y2 = [[t for t in tar.split(';')] for tar in pdfT['target']]

# Vectorizer data
hv = HashingVectorizer(stop_words='english', non_negative=True)
X1 = hv.transform(X1)
X2 = hv.transform(X2)

mlb = MultiLabelBinarizer()
mlb.fit(Y1+Y2)
Y1 = mlb.transform(Y1)
# mlb.classes_ looks like ['AAA','AAC','BBB',...]  len(mlb.classes_)==1363

# Y1 looks like [[0,0,0,....0,0,0], ... ] now

# fit
clsf = OneVsRestClassifier(BernoulliNB(alpha=.001))
clsf.fit(X1,Y1)

# predict_proba
proba = clsf.predict_proba(X2)

# want to get class names back
classnames = mlb.inverse_transform(clsf.classes_) # booom, shit happens

for i in range(len(proba)):
    # get classnames,probability dict
    preDict = dict(zip(classnames, proba[i]))
    # sort dict by probability value, print actual and top 5 predict results
    print(Y2[i], dict(sorted(preDict.items(),key=lambda d:d[1],reverse=True)[0:5]))

问题出在clsf.fit(X1,Y1)之后 clsf.classes_是一个int数组[0,1,2,3,... 1362]

为什么它不像Y1?如何从clsf.classes_获取类名? mlb.classes_ == clsf.classes_与否,具有相同的顺序?

1 个答案:

答案 0 :(得分:1)

OneVsRestClassifier符合多个标签时,LabelBinarizer调用期间会调用fit,这会将多标签转换为每个类的唯一标签。

您可以访问label_binarizer_对象的clsf属性,该属性具有类的属性,这些类将包含适合调用clsf的类的类定义。