gensim内存友好的语料库错误

时间:2017-03-27 11:40:26

标签: python-3.x gensim topic-modeling

我曾经使用python 3.5和基于gensim样本我创建了一个项目并在我的项目中添加了这些代码:

    class MyCorpus(object):
    def __iter__(self):
        for line in open('files/2/mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())


corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)

但是在运行后我在pycharm控制台中出现了这些错误:

    Traceback (most recent call last):
  File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 31, in <module>
    for vector in corpus_memory_friendly:  # load one vector into memory at a time
  File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 17, in __iter__
    yield dictionary.doc2bow(line.lower().split())
AttributeError: module 'gensim.corpora.dictionary' has no attribute 'doc2bow'

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

我们只需要事先准备好dictionary并让它可用于课程MyCorpus。创建内存友好语料库的示例类可以是:

import logging
from pprint import pprint
from six import iteritems
from gensim import corpora

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


class MyCorpus(object):
    def __init__(self, text_file='text_corpus.txt', dictionary=None):
        """
        Checks if a dictionary has been given as a parameter.
        If no dictionary has been given, it creates one and saves it in the disk.
        """
        self.file_name = text_file
        if dictionary is None:
            self.prepare_dictionary()
        else:
            self.dictionary = dictionary

    def __iter__(self):
        for line in open(self.file_name):
            # assume there's one document per line, tokens separated by whitespace
            yield self.dictionary.doc2bow(line.lower().split())

    def prepare_dictionary(self):
        stop_list = set('for a of the and to in'.split())  # List of stop words which can also be loaded from a file.

        # Creating a dictionary using stored the text file and the Dictionary class defined by Gensim.
        self.dictionary = corpora.Dictionary(line.lower().split() for line in open(self.file_name))

        # Collecting the id's of the tokens which exist in the stop-list
        stop_ids = [self.dictionary.token2id[stop_word] for stop_word in stop_list if
                    stop_word in self.dictionary.token2id]

        # Collecting the id's of the token which appear only once
        once_ids = [token_id for token_id, doc_freq in iteritems(self.dictionary.dfs) if doc_freq == 1]

        # Removing the unwanted tokens using collected id's
        self.dictionary.filter_tokens(stop_ids + once_ids)

        # Saving dictionary in the disk for later use:
        self.dictionary.save('dictionary.dict')

my_memory_fiendly_corpus = MyCorpus()

# Saving the corpus
# corpora.MmCorpus.serialize('corpus.mm', my_memory_fiendly_corpus)

# To load the saved corpus:
# corpus = corpora.MmCorpus('corpus.mm')

print('\t:::The dictionary::::')
pprint(my_memory_fiendly_corpus.dictionary.token2id)
print(my_memory_fiendly_corpus)
print('\n\t:::The corpus::::')
for vector in my_memory_fiendly_corpus:
    print(vector)

输出(没有日志信息):

    :::The dictionary::::
{'computer': 2,
 'eps': 8,
 'graph': 10,
 'human': 0,
 'interface': 1,
 'minors': 11,
 'response': 6,
 'survey': 3,
 'system': 5,
 'time': 7,
 'trees': 9,
 'user': 4}
<__main__.MyCorpus object at 0x7fe0e9ac5c18>

    :::The corpus::::
[(0, 1), (1, 1), (2, 1)]
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(1, 1), (4, 1), (5, 1), (8, 1)]
[(0, 1), (5, 2), (8, 1)]
[(4, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(3, 1), (10, 1), (11, 1)]

由于我对Gensim和Python都很陌生,我也面临着类似的问题。 this mailing-list对学习Gensim非常有帮助。