我希望你能忍受我的问题。如果不清楚,请告诉我。我试了很多细节。但它可能仍不清楚。如果是的话请告诉我。
我们有LDA主题建模,其目的是在给定一组文档的情况下生成许多主题。 所以每个文档都可以属于不同的主题。
此外,我们可以评估我们创建的模型。其中一种方法是使用像SVM这样的分类方法。我的目标是评估创建的模型。
我遇到了两种用于制作LDA model
的代码。
1
# generate LDA model
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]
# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)
这样我无法使用Fit_transform
2
tf_vectorizer = CountVectorizer(max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda_x=lda.fit_transform(tf)
在第一种方法中,LDA模型没有fit_transform方法,我不知道为什么我不理解它们之间的区别。
无论如何,我需要传递我用第一种SVM方法创建的LDA模型(之所以我把这两种方法放在这里,我知道第二种方法是没有错误可能因为fit_transform但是对于某些我无法使用的原因), 这是我的最终代码:
import os
from gensim.models import ldamodel
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = {'a'}
# Create p_stemmer of class PorterStemmer
lines=[]
p_stemmer = PorterStemmer()
lisOfFiles=[x[2] for x in os.walk("data")]
fullPath = [x[0] for x in os.walk("data")]
for j in lisOfFiles[2]:
with open(os.path.join(fullPath[2],j)) as f:
a=f.read()
lines.append(a)
for j in lisOfFiles[3]:
with open(os.path.join(fullPath[3],j)) as f:
a=f.read()
lines.append(a)
for j in lisOfFiles[4]:
with open(os.path.join(fullPath[4],j)) as f:
a=f.read()
lines.append(a)
# compile sample documents into a list
doc_set = lines
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# generate LDA model
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]
# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)
# Assigns the topics to the documents in corpus
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
#creating the labels
lda_corpus = lda[mm]
label_y=[]
for i in lda_corpus:
new_y = []
for l in i:
sorted_labels = sorted(i, key=lambda z: z[0], reverse=True)
if l[1] > 0.005:
new_y.append(l[0])
label_y.append(new_y)
classifier = Pipeline([
('vectorizer', CountVectorizer(max_df=2,min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(lda, label_y)
正如您在我的代码中看到的,出于某些原因我使用了第一种方法,
但在最后一行,它引发了一个错误(object of type int has no len()
)。它似乎无法接受以这种方式创建的lda
(我在想,因为这样我没有使用fit_transform)
如何使用我的代码修复此错误?
这是完整的堆栈错误:
/home/saria/tfwithpython3.6/bin/python /home/saria/PycharmProjects/TfidfLDA/test4.py
Using TensorFlow backend.
Traceback (most recent call last):
File "/home/saria/PycharmProjects/TfidfLDA/test4.py", line 92, in <module>
classifier.fit(lda, label_y)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 268, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 234, in _fit
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 760, in _count_vocab
for doc in raw_documents:
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 1054, in __getitem__
return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 922, in get_document_topics
gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 429, in inference
if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
TypeError: object of type 'int' has no len()
Process finished with exit code 1