Question

我有一组使用tm包在R中清理过的文档。最终，它们被转换为数据帧并使用write.table函数保存为.txt文件。

然后我使用sklearn和ntlk编写python代码，使用具有朴素贝叶斯分类器的管道对这些文档进行分类（这并不重要），并将预测结果转储到数据框中。分类有效。但是，当我恢复预测标签和预测概率时（使用sklearn.pred_proba），我为每个文档获得了2组概率。我相信这种情况的原因是因为在R中处理的文件有一个与它们相关联的“doc_id”，我相信我编写的Python分类代码试图对“doc_id”中的文本进行分类以及实际文件中的文本，以便在将单个文档分类为3个类别时，而不是使用维度[1,3]的ndarray，而是恢复具有维度[2,3]的ndarray。我已经阅读了关于tm，sklearn和ntlk的文档，我已经阅读了根据我的搜索条件提出的所有堆栈交换博客文章，但我无法弄清楚如何：

1）从R文件中删除doc_id 2）有sklearn model.predict和model.predict_proba只对文本进行分类，而不是doc_id。

你能提供的任何帮助都会很棒。这似乎应该是一个简单的修复，但我迄今为止尝试过的解决方案都没有修复它，包括使用python中的CategorizedPlaintextCorpusReader函数使用reader.raw提取原始文本，因为model.predict函数不接受一系列字符串，它是reader.raw生成的对象。

我的python代码可能很有用：

import nltk
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformerCleaned_For_Classification_USStateDepartment_Volumes
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
import os
import pandas as pd
import numpy as np

reader = CategorizedPlaintextCorpusReader('file', r'file_.*\.txt', cat_pattern =r'file_(\w+)\.txt')
liberaltext = reader.raw(fileids='file_liberal.txt')
realisttext = reader.raw(fileids='file_realist.txt')
atheoretictext = reader.raw(fileids='file_atheoretic.txt')
documents = [liberaltext, realisttext, atheoretictext]
categories= ['liberal','realist', 'neither']
listofpredprobs = pd.DataFrame({"liberal": [0], "realist": [0], "neither":[0]})

def stemming_tokenizer(text):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in word_tokenize(text)]

model = make_pipeline(TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words = stopwords.words('english'), encoding=u'UTF-8'), MultinomialNB())
model.fit(documents, categories)
for root, dirs, files in os.walk('file'):
    for file in files:
        test = open(os.path.join("file", file), 'r')
        labels = model.predict(test)
        test = open(os.path.join("file", file), 'r')
        predprobs = model.predict_proba(test)
        dfpp = pd.DataFrame(predprobs, columns=categories)
        listofpredprobs = listofpredprobs.append(dfpp)

Answer 1

嗯，为了解决这个问题，我最终重新处理了使用python进行分类的文档。这就是诀窍！它花了几个小时来处理它们，但它现在完成了。谢谢你的帮助。

使用NLTK和SKLEARN对R语料库进行分类

1 个答案: