用于word2vec

时间:2016-09-02 21:54:52

标签: python numpy word2vec

我正在使用群集来生成word2vec模型,使用来自存储在JSON文件中的医学期刊的句子中的gensim,并且我的内存使用量太大了。

任务是保留所有句子的累积列表,直到特定年份,然后为该年生成word2vec模型。然后,将下一年的句子添加到累积列表中,并根据所有句子生成并保存该年份的另一个模型。

此特定群集上的I / O足够慢且数据足够大(在内存中读取2/3大约需要3天),每年从磁盘流式传输每个JSON的模型都会花费很长时间,因此解决方案是将所有90GB的JSON加载到python列表中的内存中。我有权为此使用高达256GB的内存,但如果有必要可以获得更多内存。

我遇到的麻烦是我的内存不足。我已经阅读了一些关于Python实现免费列表而不是将内存返回给OS的方式的其他帖子,我认为这可能是问题的一部分,但我不确定。

认为自由列表可能是问题,并且可能是一个numpy会有更好的实现大量元素,我从累积的句子列表变为累积的句子数组(gensim要求句子列表单词/字符串)。但我在句子的一小部分上运行它并且它使用了更多的内存,所以我不确定如何继续。

如果有任何人有这方面的经验,我很乐意得到你的帮助。此外,如果还有其他任何事情可以改变,我将非常感谢你告诉我。完整代码如下:

import ujson as json
import os
import sys
import logging
from gensim.models import word2vec
import numpy as np

PARAMETERS = {
    'numfeatures': 250,
    'minwordcount': 10,
    'context': 7,
    'downsampling': 0.0001,
    'workers': 32
}

logger = logging.getLogger()
handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s %(levelname)-8s %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)


def generate_and_save_model(cumulative_sentences, models_dir, year):
    """
    Generates and saves the word2vec model for the given year
    :param cumulative_sentences: The list of all sentences up to the current year
    :param models_dir: The directory to save the models to
    :param year: The current year of interest
    :return: Nothing, only saves the model to disk
    """
    cumulative_model = word2vec.Word2Vec(
        sentences=cumulative_sentences,
        workers=PARAMETERS['workers'],
        size=PARAMETERS['numfeatures'],
        min_count=PARAMETERS['minwordcount'],
        window=PARAMETERS['context'],
        sample=PARAMETERS['downsampling']
    )
    cumulative_model.init_sims(replace=True)
    cumulative_model.save(models_dir + 'medline_abstract_word2vec_' + year)


def save_year_models(json_list, json_dir, models_dir, min_year, max_year):
    """
    :param json_list: The list of json year_sentences file names
    :param json_dir: The directory holding the the sentences json files
    :param models_dir: The directory to serialize the models to
    :param min_year: The minimum value of a year to generate a model for
    :param max_year: The maximum value of a year to generate a model for
    Goes year by year through each json of sentences, saving a cumulative word2vec
    model for each year
    """

    cumulative_sentences = np.array([])

    for json_file in json_list:
        year = json_file[16:20]

        # If this year is greater than the maximum, we're done creating models
        if int(year) > max_year:
            break

        with open(json_dir + json_file, 'rb') as current_year_file:
            cumulative_sentences = np.concatenate(
                (np.array(json.load(current_year_file)['sentences']),
                 cumulative_sentences)
            )

        logger.info('COMPLETE: ' + year + ' sentences loaded')
        logger.info('Cumulative length: ' + str(len(cumulative_sentences)) + ' sentences loaded')
        sys.stdout.flush()

        # If this year is less than our minimum, add its sentences to the list and continue
        if int(year) < min_year:
            continue

        generate_and_save_model(cumulative_sentences, models_dir, year)

        logger.info('COMPLETE: ' + year + ' model saved')
        sys.stdout.flush()


def main():
    json_dir = '/projects/chemotext/sentences_by_year/'
    models_dir = '/projects/chemotext/medline_year_models/'

    # By default, generate models for all years we have sentences for
    minimum_year = 0
    maximum_year = 9999

    # If one command line argument is used
    if len(sys.argv) == 2:
        # Generate the model for only that year
        minimum_year = int(sys.argv[1])
        maximum_year = int(sys.argv[1])

    # If two CL arguments are used
    if len(sys.argv) == 3:
        # Generate all models between the two year arguments, inclusive
        minimum_year = int(sys.argv[1])
        maximum_year = int(sys.argv[2])

    # Sorting the list of files so that earlier years are first in the list
    json_list = sorted(os.listdir(json_dir))

    save_year_models(json_list, json_dir, models_dir, minimum_year, maximum_year)

if __name__ == '__main__':
    main()

1 个答案:

答案 0 :(得分:0)

我认为你应该能够通过仅显式存储每个单词的第一个出现来显着减少语料库的内存占用。之后的所有事件只需要存储对第一个的引用。这样,您就不会在重复的字符串上花费内存,代价是一些开销。在代码中它看起来像这样:

class WordCache(object):
    def __init__(self):
        self.cache = {}

    def filter(self, words):
        for i, word in enumerate(words):
            try:
                words[i] = self.cache[word]
            except KeyError:
                self.cache[word] = word
        return words

cache = WordCache()
...
for sentence in json.load(current_year_file)['sentences']:
    cumulative_sentences.append(cache.filter(sentence))

您可能尝试的另一件事是转向Python 3.3或更高版本。它具有更高内存效率的Unicode字符串表示,请参阅PEP 393