Gensim在不将所有文本加载到内存gensim的情况下构造字典

时间:2018-08-24 22:49:37

标签: python bigdata gensim multiple-files

与其从一个文档('mycorpus.txt')构造,不如从多个文档中构造字典(每个文档的文件大小为25 MB,10,000个文件),请注意,我正在尝试通过gensim“构建字典而不将所有文本加载到内存中”

>>> from gensim import corpora
>>> from six import iteritems
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
>>>             if stopword in dictionary.token2id]
>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
>>> dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
>>> dictionary.compactify()  # remove gaps in id sequence after words that were removed
>>> print(dictionary)

1 个答案:

答案 0 :(得分:1)

为此您需要一个iterator
摘自the gensim webiste

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.lower().split()

sentences = MySentences('/some/directory') # a memory-friendly iterator

sentencesiterator,它将在需要时打开每个文件,使用它,然后销毁该实例。因此,在任何时候,内存中只有一个文件。

从网站上:

  

如果我们的输入分散在磁盘上的多个文件中,每行一句话,那么与其将所有内容加载到内存列表中,我们就可以逐行处理输入文件

要在您的情况下使用它,只需将dictionary行替换为:

dictionary = corpora.Dictionary(line for line in sentences)

其中sentences是我们之前定义的变量,为该变量提供了包含多个.txt文件的文件夹的路径。

要了解有关迭代器,可迭代对象和生成器的更多信息,请查看this blog