分类器有时不会影响文档到类别

时间:2017-05-15 13:02:10

标签: python scikit-learn classification document-classification

我目前正在尝试将某些文档归类为固定数量的类别。 这里的主要问题是有时分类器似乎找不到合适的类别。因此,输出为空。

我使用以下代码:

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

class DenseTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

classifier = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('to_dense', DenseTransformer()),
    ('clf', OneVsRestClassifier(GaussianNB()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

输出的一个例子:

doc 0 : ""
doc 1 : "news"
doc 2 : "spam"
doc 3 : ""
doc 4 : ""
doc 5 : "news"
doc 6 : "tech-news"

原则是使用知道tf-idf的相似性比较为每个文档分配一个类别吗? (tf-idf表示文档中单词的频率)

编辑:示例代码

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york",
                    "I love fruits mate",
                    "I usually eat apples",
                    "we should go for bananas or other fruits"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"],["Fruits"],["Fruits"],["Fruits"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too',
                   'how about fruits like apples or something today ?',
                   'shall we go for apples ?'])
target_names = ['New York', 'London','Fruits']



    classifier = Pipeline([
        ('vectorizer', CountVectorizer(stop_words='english')),
        ('tfidf', TfidfTransformer()),
        ('clf', OneVsRestClassifier(SVC(kernel="linear",decision_function_shape='ovo')))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

此示例代码提供以下输出:

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => 
it is raining in britian and the big apple => new york
it is raining in britian and nyc => new york
hello welcome to new york. enjoy it here and london too => london, new york
how about fruits like apples or something today ? => Fruits
shall we go for apples ? => Fruits

0 个答案:

没有答案