sklearn如何将MultiOutputClassifier与多标签文本分类一起使用

时间:2018-08-11 18:50:11

标签: python scikit-learn classification text-classification

我正在尝试进行多输出多标签多类文本分类。下面的示例有效,但我知道使用MultiOutputClassifier不能正确进行。我认为,这样做的重点是只需要训练一次就可以适应一次,即使对于多个输出也是如此。如何通过一次传递数据来做到这一点?

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier

X_train = np.array(["new york is a really big city",
                    "new york was originally dutch",
                    "the big apple is huge",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "edinburgh is a small city in great britain",
                    "a northern city in great britain is edinburgh",
                    "edinburgh is in the uk",
                    "edinburgh is in england",
                    "edinburgh is in great britain",
                    "edinburgh is not big",
                    "edinburgh hosts the holyrood palace and new york hosts the empire state building",
                    "nyc is big and edinburgh is smaller",
                    "i like edinburgh better than new york"])
y_train_text_1 = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["edinburgh"],["edinburgh"],["edinburgh"],["edinburgh"],
                ["edinburgh"],["edinburgh"],["new york","edinburgh"],["edinburgh"],["new york","edinburgh"]]
y_train_text_2 = [["big"],[""],["big"],["big"],[""],
                [""],["small"],[""],[""],[""],
                [""],["big"],[""],["big","small"],[""]]

X_test = np.array(['nice day in nyc',
                   'my big day in edinburgh',
                   'edinburgh is small but nyc is big',
                   'it is raining in britain',
                   'it is raining in britain and the big apple',
                   'it is raining in britain and nyc',
                   'hello welcome to new york. enjoy it here and edinburgh too'])

mlb_1 = MultiLabelBinarizer()
Y_1 = mlb_1.fit_transform(y_train_text_1)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(OneVsRestClassifier(LinearSVC())))])

classifier.fit(X_train, Y_1)
predicted = classifier.predict(X_test)
all_labels = mlb_1.inverse_transform(predicted)

print('city name classifier:')
for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

# Now fit on second output - can I do it all at once?
mlb_2 = MultiLabelBinarizer()
Y_2 = mlb_2.fit_transform(y_train_text_2)
classifier.fit(X_train, Y_2)
predicted = classifier.predict(X_test)
all_labels = mlb_2.inverse_transform(predicted)

print('city size classifier:')
for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

运行它的输出:

city name classifier:
nice day in nyc => new york
my big day in edinburgh => edinburgh
edinburgh is small but nyc is big => edinburgh
it is raining in britain => edinburgh
it is raining in britain and the big apple => edinburgh
it is raining in britain and nyc => edinburgh
hello welcome to new york. enjoy it here and edinburgh too => edinburgh, new york
city size classifier:
nice day in nyc => 
my big day in edinburgh => 
edinburgh is small but nyc is big => big, small
it is raining in britain => 
it is raining in britain and the big apple => big
it is raining in britain and nyc => 
hello welcome to new york. enjoy it here and edinburgh too => 

0 个答案:

没有答案