拟合类型' int'的分类器对象没有len()

时间:2017-08-11 08:59:24

标签: python svm text-classification lda topic-modeling

我希望你能忍受我的问题。如果不清楚,请告诉我。我试了很多细节。但它可能仍不清楚。如果是的话请告诉我。

我们有LDA主题建模,其目的是在给定一组文档的情况下生成许多主题。 所以每个文档都可以属于不同的主题。

此外,我们可以评估我们创建的模型。其中一种方法是使用像SVM这样的分类方法。我的目标是评估创建的模型。

我遇到了两种用于制作LDA model的代码。

1

    # generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

这样我无法使用Fit_transform

2

tf_vectorizer = CountVectorizer(max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda_x=lda.fit_transform(tf) 

在第一种方法中,LDA模型没有fit_transform方法,我不知道为什么我不理解它们之间的区别。

无论如何,我需要传递我用第一种SVM方法创建的LDA模型(之所以我把这两种方法放在这里,我知道第二种方法是没有错误可能因为fit_transform但是对于某些我无法使用的原因), 这是我的最终代码:

import os
from gensim.models import ldamodel
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC


tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = {'a'}

# Create p_stemmer of class PorterStemmer
lines=[]
p_stemmer = PorterStemmer()
lisOfFiles=[x[2] for x in os.walk("data")]

fullPath = [x[0] for x in os.walk("data")]
for j in lisOfFiles[2]:
    with open(os.path.join(fullPath[2],j)) as f:
                    a=f.read()
                    lines.append(a)


for j in lisOfFiles[3]:
    with open(os.path.join(fullPath[3],j)) as f:
                    a=f.read()
                    lines.append(a)

for j in lisOfFiles[4]:
    with open(os.path.join(fullPath[4],j)) as f:
                    a=f.read()
                    lines.append(a)

# compile sample documents into a list
doc_set = lines
# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)

# generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

# Assigns the topics to the documents in corpus

dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]


#creating the labels
lda_corpus = lda[mm]
label_y=[]
for i in lda_corpus:
    new_y = []
    for l in i:
        sorted_labels = sorted(i, key=lambda z: z[0], reverse=True)
        if l[1] > 0.005:
            new_y.append(l[0])
        label_y.append(new_y)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_df=2,min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(lda, label_y)

正如您在我的代码中看到的,出于某些原因我使用了第一种方法, 但在最后一行,它引发了一个错误(object of type int has no len())。它似乎无法接受以这种方式创建的lda(我在想,因为这样我没有使用fit_transform) 如何使用我的代码修复此错误?

非常感谢您的耐心和提前帮助。

这是完整的堆栈错误:

/home/saria/tfwithpython3.6/bin/python /home/saria/PycharmProjects/TfidfLDA/test4.py
Using TensorFlow backend.
Traceback (most recent call last):
  File "/home/saria/PycharmProjects/TfidfLDA/test4.py", line 92, in <module>
    classifier.fit(lda, label_y)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
    self.fixed_vocabulary_)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 760, in _count_vocab
    for doc in raw_documents:
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 1054, in __getitem__
    return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 922, in get_document_topics
    gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 429, in inference
    if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
TypeError: object of type 'int' has no len()

Process finished with exit code 1

0 个答案:

没有答案