我有一个数据集,每个文档都有一个标签,如下例所示。
label text
pay "i will pay now"
finance "are you the finance guy?"
law "lawyers and law"
court "was at the court today"
finance report "bank reported annual share.."
文本文档可以标记多个标签,那么如何对此数据集进行多标签分类?我已经阅读了sklearn
的大量文档,但我似乎无法找到在单标签数据集上进行多标签分类的正确方法。提前感谢您的帮助。
到目前为止,这就是我所拥有的:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn import preprocessing
loc = r'C:\Users\..\Downloads\excel.xlsx'
df = pd.read_excel(loc)
X = np.array(df.docs)
z = np.array(df.title)
y = np.array(df.raw)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform(y_train)
Y_test = mlb.fit_transform(y_test)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
doc_new = np.array(['X has announced that it will sell $587 million'])
print("Accuracy Score: ", accuracy_score(Y_test, predicted))
print(mlb.inverse_transform(classifier.predict(doc_new)))
但我不断收到尺寸错误:
.format(len(self.classes_),yt.shape [1]))ValueError:44个类的预期指标,但得到46
答案 0 :(得分:0)
我想解决这个问题。我用过pandas GroupBy
df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index()
将包含多个类的文本组合在一起并且有效。
维度错误也已解决:dimension error