Scikit-Learn / Python文本分类

时间:2018-01-24 09:38:45

标签: python scikit-learn

我正在使用Scikit-learn进行文本分类。我使用Naives贝叶斯分类将非结构化文本(下面数据集中的详细信息列)分类为一组标记目标(类别),我得到了测试数据的准确性,但有些人可以告诉我如何打印,类别每个非结构化文本(来自下面数据集中的详细信息列)属于?

以下是我的示例数据集的样子。

Details                                     |Category
-------------------------------------------------------------                                
Tanishq Jwellery Bangalore                  |jwellery
ODESK***BAL-28APR13                         |Others
AEGON RELIGARE LIFE IN                      |Others
INTERNET PAYMENT #999999                    |Transfer in for Card Payment
WWW.VISTAPRINT.IN                           |Others
Khazana Jwellery                            |jwellery
INTERNET PAYMENT #999999                    |Transfer in for Card Payment
Indian Oil                                  |Fuel
Touch foot wear                             |Clothing

这是我的代码的一部分:

import pandas as pd
import numpy as np
import scipy as sp

from sklearn.model_selection import train_test_split 
u_cols = ['Details','Category'] 
k= pd.read_csv('mydatset.csv', delimiter='\t',usecols = u_cols)
data=k[1:1000]
target_one=data['Category']
from sklearn.cross_validation import train_test_split

def train(classifier, X, y):
    X_train, X_test, y_train, y_test = 
    train_test_split(data.Details.values.astype('U'), target_one, 
    test_size=0.50, random_state=33)

    classifier.fit(X_train, y_train)
    print ("Accuracy: %s" % classifier.score(X_test, y_test))
    return classifier

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

trial1 = Pipeline([('vectorizer', TfidfVectorizer()),
                 ('classifier', MultinomialNB())])

train(trial1, data.Details.values.astype('U'), target_one)

1 个答案:

答案 0 :(得分:0)

您需要将返回的分类器存储到某个对象,然后在其上调用predict()。 像这样:

trained_clf = train(trial1, data.Details.values.astype('U'), target_one)

predictions = trained_clf.predict(unstructured_data.Details.values.astype('U'))

更新: - 如果您正在讨论train()函数内部的准确性,那么您可以使用以下内容打印预测类别:

def train(classifier, X, y):
    ...
    ...
    print ("Accuracy: %s" % classifier.score(X_test, y_test))
    print ("Predictions: %s" % classifier.predict(X_test)


    return classifier