我曾经使用python 3.5和基于gensim样本我创建了一个项目并在我的项目中添加了这些代码:
class MyCorpus(object):
def __iter__(self):
for line in open('files/2/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)
但是在运行后我在pycharm控制台中出现了这些错误:
Traceback (most recent call last):
File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 31, in <module>
for vector in corpus_memory_friendly: # load one vector into memory at a time
File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 17, in __iter__
yield dictionary.doc2bow(line.lower().split())
AttributeError: module 'gensim.corpora.dictionary' has no attribute 'doc2bow'
我该如何解决这个问题?
答案 0 :(得分:0)
我们只需要事先准备好dictionary
并让它可用于课程MyCorpus
。创建内存友好语料库的示例类可以是:
import logging
from pprint import pprint
from six import iteritems
from gensim import corpora
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class MyCorpus(object):
def __init__(self, text_file='text_corpus.txt', dictionary=None):
"""
Checks if a dictionary has been given as a parameter.
If no dictionary has been given, it creates one and saves it in the disk.
"""
self.file_name = text_file
if dictionary is None:
self.prepare_dictionary()
else:
self.dictionary = dictionary
def __iter__(self):
for line in open(self.file_name):
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line.lower().split())
def prepare_dictionary(self):
stop_list = set('for a of the and to in'.split()) # List of stop words which can also be loaded from a file.
# Creating a dictionary using stored the text file and the Dictionary class defined by Gensim.
self.dictionary = corpora.Dictionary(line.lower().split() for line in open(self.file_name))
# Collecting the id's of the tokens which exist in the stop-list
stop_ids = [self.dictionary.token2id[stop_word] for stop_word in stop_list if
stop_word in self.dictionary.token2id]
# Collecting the id's of the token which appear only once
once_ids = [token_id for token_id, doc_freq in iteritems(self.dictionary.dfs) if doc_freq == 1]
# Removing the unwanted tokens using collected id's
self.dictionary.filter_tokens(stop_ids + once_ids)
# Saving dictionary in the disk for later use:
self.dictionary.save('dictionary.dict')
my_memory_fiendly_corpus = MyCorpus()
# Saving the corpus
# corpora.MmCorpus.serialize('corpus.mm', my_memory_fiendly_corpus)
# To load the saved corpus:
# corpus = corpora.MmCorpus('corpus.mm')
print('\t:::The dictionary::::')
pprint(my_memory_fiendly_corpus.dictionary.token2id)
print(my_memory_fiendly_corpus)
print('\n\t:::The corpus::::')
for vector in my_memory_fiendly_corpus:
print(vector)
输出(没有日志信息):
:::The dictionary::::
{'computer': 2,
'eps': 8,
'graph': 10,
'human': 0,
'interface': 1,
'minors': 11,
'response': 6,
'survey': 3,
'system': 5,
'time': 7,
'trees': 9,
'user': 4}
<__main__.MyCorpus object at 0x7fe0e9ac5c18>
:::The corpus::::
[(0, 1), (1, 1), (2, 1)]
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(1, 1), (4, 1), (5, 1), (8, 1)]
[(0, 1), (5, 2), (8, 1)]
[(4, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(3, 1), (10, 1), (11, 1)]
由于我对Gensim和Python都很陌生,我也面临着类似的问题。 this mailing-list对学习Gensim非常有帮助。