我有一个csv文件,如下所示:
target,data
AAA,some text document
AAA;BBB,more text
AAC,more text
以下是代码:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.naive_bayes import BernoulliNB
import pandas as pd
pdf = pd.read_csv("Train.csv", sep=',')
pdfT = pd.read_csv("Test.csv", sep=',')
X1 = pdf['data']
Y1 = [[t for t in tar.split(';')] for tar in pdf['target']]
X2 = pdfT['data']
Y2 = [[t for t in tar.split(';')] for tar in pdfT['target']]
# Vectorizer data
hv = HashingVectorizer(stop_words='english', non_negative=True)
X1 = hv.transform(X1)
X2 = hv.transform(X2)
mlb = MultiLabelBinarizer()
mlb.fit(Y1+Y2)
Y1 = mlb.transform(Y1)
# mlb.classes_ looks like ['AAA','AAC','BBB',...] len(mlb.classes_)==1363
# Y1 looks like [[0,0,0,....0,0,0], ... ] now
# fit
clsf = OneVsRestClassifier(BernoulliNB(alpha=.001))
clsf.fit(X1,Y1)
# predict_proba
proba = clsf.predict_proba(X2)
# want to get class names back
classnames = mlb.inverse_transform(clsf.classes_) # booom, shit happens
for i in range(len(proba)):
# get classnames,probability dict
preDict = dict(zip(classnames, proba[i]))
# sort dict by probability value, print actual and top 5 predict results
print(Y2[i], dict(sorted(preDict.items(),key=lambda d:d[1],reverse=True)[0:5]))
问题出在clsf.fit(X1,Y1)之后 clsf.classes_是一个int数组[0,1,2,3,... 1362]
为什么它不像Y1?如何从clsf.classes_获取类名? mlb.classes_ == clsf.classes_与否,具有相同的顺序?
答案 0 :(得分:1)
当OneVsRestClassifier
符合多个标签时,LabelBinarizer
调用期间会调用fit
,这会将多标签转换为每个类的唯一标签。
您可以访问label_binarizer_
对象的clsf
属性,该属性具有类的属性,这些类将包含适合调用clsf
的类的类定义。