将LSA / LSI与Naive Bayes相结合用于文档分类

时间:2015-04-29 01:23:12

标签: document-classification gensim naivebayes latent-semantic-indexing latent-semantic-analysis

我一般都是gensim包和向量空间模型的新手,而且我不确定我应该如何处理我的LSA输出。

为了简要概述我的目标,我想使用主题建模来增强朴素贝叶斯分类器,以改进评论的分类(正面或负面)。这是great paper我一直在阅读,这已经塑造了我的想法,但让我对实施仍感到有些困惑。

我已经为Naive Bayes编写了代码 - 目前,我只是使用unigram包字,因为我的功能和标签是正面的还是负面的。

这是我的gensim代码

from pprint import pprint # pretty printer
import gensim as gs

# tutorial sample documents
docs = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]


# stoplist removal, tokenization
stoplist = set('for a of the and to in'.split())
# for each document: lowercase document, split by whitespace, and add all its words not in stoplist to texts
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in docs]


# create dict
dict = gs.corpora.Dictionary(texts)
# create corpus
corpus = [dict.doc2bow(text) for text in texts]

# tf-idf
tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

# latent semantic indexing with 10 topics
lsi = gs.models.LsiModel(corpus_tfidf, id2word=dict, num_topics =10)

for i in lsi.print_topics():
    print i

这是输出

0.400*"system" + 0.318*"survey" + 0.290*"user" + 0.274*"eps" + 0.236*"management" + 0.236*"opinion" + 0.235*"response" + 0.235*"time" + 0.224*"interface" + 0.224*"computer"
0.421*"minors" + 0.420*"graph" + 0.293*"survey" + 0.239*"trees" + 0.226*"paths" + 0.226*"intersection" + -0.204*"system" + -0.196*"eps" + 0.189*"widths" + 0.189*"quasi"
-0.318*"time" + -0.318*"response" + -0.261*"error" + -0.261*"measurement" + -0.261*"perceived" + -0.261*"relation" + 0.248*"eps" + -0.203*"opinion" + 0.195*"human" + 0.190*"testing"
0.416*"random" + 0.416*"binary" + 0.416*"generation" + 0.416*"unordered" + 0.256*"trees" + -0.225*"minors" + -0.177*"survey" + 0.161*"paths" + 0.161*"intersection" + 0.119*"error"
-0.398*"abc" + -0.398*"lab" + -0.398*"machine" + -0.398*"applications" + -0.301*"computer" + 0.242*"system" + 0.237*"eps" + 0.180*"testing" + 0.180*"engineering" + 0.166*"management"

任何建议或一般性评论都将不胜感激。

1 个答案:

答案 0 :(得分:0)

刚刚开始处理同样的问题,但是在使用SVM代替AFAIK训练模型后,你需要做这样的事情:

new_text = 'here is some document'
text_bow = dict.doc2bow(new_text)
vector = lsi[text_bow]

向量是文档中的主题分布,其长度等于您选择用于培训的主题数量,在您的情况下为10。 因此,您需要将所有文档表示为主题分布,然后将其提供给分类算法。

P.S。我知道这是一个古老的问题,但我每次搜索时都会在Google搜索结果中看到它。