使用NaiveBayesClassifier对文本进行分类

时间:2018-05-29 07:08:11

标签: python-3.x machine-learning scikit-learn nlp nltk

我有一个文本文件,每行有一个句子: 例如“”您是否在银行账户中注册了您的电子邮件ID?“

我想将其分类为疑问句或非疑问句。仅供参考,这些是来自银行网站的句子。 我见过this answer 使用这个nltk代码块:

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

所以我对我的文本文件进行了一些预处理,即删除单词,删除单词等,使每个句子成为一个单词。从上面的代码中,我有一个训练有素的分类器。如何在我的句子文本文件(原始或预处理)上实现它?

更新:here是我的文本文件的示例。

2 个答案:

答案 0 :(得分:1)

假设您已按照我们的讨论预处理了文档数据,您可以执行以下操作:

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(nltk.classify.accuracy(classifier, test_set))

0.668

对于您的数据,您可以迭代您的行并适合,预测:

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

答案 1 :(得分:0)

对文本文件中的所有行执行此操作:

classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))