Question

Gensim使用文本流来最大程度地减少内存需求。由于无休止的磁盘IO，这是以性能为代价的。有没有技巧可以将整个文件从磁盘（一个磁盘IO）复制到临时的内存文件中？我喜欢将代码保持原样（不重新编码为列表结构），但这不是调试功能的好方法

预期结果：更快的代码

有关此问题的更多背景信息

原始代码位于https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb。示例代码摘自短语建模部分

我正在计算字母组合。所有评论均位于

review_txt_filepath = os.path.join(intermediate_directory,'review_text_all.txt'),

所有的字母组合都应该去

unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')

关键例程是

def punct_space(token):
    return token.is_punct or token.is_space

def line_review(filename):
    # generator function to read in reviews from the file
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')

def lemmatized_sentence_corpus(filename):
    # generator function to use spaCy to parse reviews, lemmatize the text, and yield sentences

    for parsed_review in nlp.pipe(line_review(filename),
                              batch_size=10000, n_threads=4):
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

字母组合的计算方式为

with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
    for sentence in lemmatized_sentence_corpus(review_txt_filepath):
        f.write(sentence + '\n')

这样做需要5000行，需要一些耐心，1h30m;-）

我对Iterables并不熟悉，但是我必须首先阅读实际文件（在光盘上）转换为变量“ list_of_data”并对其进行处理

with (review_txt_filepath, 'r', encoding='utf_8') as f:
    list_of_data = f.read()

with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
    for sentence in lemmatized_sentence_corpus(list_of_data):
        f.write(sentence + '\n')

所以策略是

1. read all data into a list in memory
2. process the data
3. write the results to disc
4. delete the list from memory by setting list_with_data = ()

与此有关的一个问题显然是line_review正在读取文件

Answer 1

大多数gensim接口实际上采用可迭代序列。强调从磁盘进行流传输的示例只是碰巧使用了可迭代项，可迭代项根据需要读取每个项目，但是您可以使用内存中列表代替。

本质上，如果您确实有足够的RAM可以将整个数据集存储在内存中，则只需使用IO读取可迭代的操作即可将内容一次读入列表中。然后，将该列表提供给gensim类，该类需要任何可迭代的序列。

这不应该涉及任何“重新编码为列表结构”，而是使用Python list类型将内容保存在内存中。这是最自然的方式，而且可能是最有效的方式，尤其是在对标记化文本进行多次传递的算法中。

（较少习惯的方法，例如，将整个文件加载到原始字节数组中，然后对算法所需的单个项目重复读取该文件样式的文件，这是一种较笨拙的方法。它可能类似节省了重复的IO成本，但可能会浪费大量精力进行重复处理的项目的重新解析/标记。如果您有内存，则需要将每个项目作为Python对象保留在内存中，这需要将它们放在列表中。）

要更具体地回答问题，您需要在问题中提供更多详细信息，例如您使用的是哪种特定算法/语料库阅读样式，最好是示例代码。

Gensim中的文本流

有关此问题的更多背景信息

1 个答案: