Python:空间和内存消耗

时间:2019-04-25 02:45:31

标签: python-3.x spacy

1-问题

我正在python上使用“ spacy”对文本文档进行词素化。 有500,000个文档,其原始文本的大小最大为20 Mb。

问题出在以下方面:直到整个内存都用完为止,spacy内存的消耗才随着时间增长。

2-背景

我的硬件配置: CPU:Intel I7-8700K 3.7 GHz(12核心) 记忆体:16 Gb 固态硬盘:1 TB 板载GPU,但不用于此任务

我正在使用“多处理”将任务划分为多个流程(工作人员)。 每个工作人员都会收到一份要处理的文件清单。 主进程执行子进程的监视。 我在每个子进程中都启动一次“ spacy”,并使用这个spacy实例处理工作程序中的整个文档列表。

内存跟踪显示以下内容:

  

[内存跟踪-前10名]

     

/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68:大小= 45.1 MiB,计数= 99,平均值= 467 KiB

     

/opt/develop/virtualenv/lib/python3.6/posixpath.py:149:大小= 40.3 MiB,计数= 694225,平均值= 61 B

     

:487:大小= 9550 KiB,计数= 77746,平均值= 126 B

     

/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33:size = 7901 KiB,count = 6,平均值= 1317 KiB

     

/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/zh-CN/lemmatizer/_nouns.py:7114:size = 5273 KiB,count = 57494,平均值= 94 B

     

prepare_docs04.py:372:大小= 4189 KiB,计数= 1,平均值= 4189 KiB

     

/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93:size = 3949 KiB,count = 5,平均值= 790 KiB

     

/usr/lib/python3.6/json/decoder.py:355:大小= 1837 KiB,计数= 20456,平均值= 92 B

     

/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/zh-CN/lemmatizer/_adjectives.py:2828:大小= 1704 KiB,计数= 20976,平均值= 83 B

     

prepare_docs04.py:373:大小= 1633 KiB,计数= 1,平均值= 1633 KiB

3-期望

我看到了一个很好的建议,那就是在此构建一个独立的服务器-客户端解决方案Is possible to keep spacy in memory to reduce the load time?

是否可以使用“多处理”方法来控制内存消耗?

4-代码

这是我的代码的简化版本:

import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep

# START: memory trace
tracemalloc.start()

# Load spacy
spacyMorph = spacy.load("en_core_web_sm")

#
# Get word's lemma
#
def getLemma(word):
    global spacyMorph
    lemmaOutput = spacyMorph(str(word))
    return lemmaOutput


#
# Worker's logic
#
def workerNormalize(lock, conn, params):
    documentCount = 1
    for filenameRaw in params[1]:
        documentTotal = len(params[1])
        documentID = int(os.path.basename(filenameRaw).split('.')[0])

        # Send to the main process the worker's current progress
        if not lock is None:
            lock.acquire()
            try:
                statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
                conn.send(statusMessage)
                documentCount += 1
            finally:
                lock.release()
        else:
            print(statusMessage)

        # ----------------
        # Some code is excluded for clarity sake
        # I've got a "wordList" from file "filenameRaw"
        # ----------------

        wordCount = 1
        wordTotalCount = len(wordList)

        for word in wordList:
            lemma = getLemma(word)
            wordCount += 1

        # ----------------
        # Then I collect all lemmas and save it to another text file
        # ----------------

        # Here I'm trying to reduce memory usage
        del wordList
        del word
        gc.collect()


if __name__ == '__main__':
    lock = Lock()
    processList = []

    # ----------------
    # Some code is excluded for clarity sake
    # Here I'm getting full list of files "fileTotalList" which I need to lemmatize
    # ----------------
    while cursorEnd < (docTotalCount + stepSize):
        fileList = fileTotalList[cursorStart:cursorEnd]

        # ----------------
        # Create workers and populate it with list of files to process
        # ----------------
        processData = {}
        processData['total'] = len(fileList)  # worker total progress
        processData['count'] = 0  # worker documents done count
        processData['currentDocID'] = 0  # current document ID the worker is working on
        processData['comment'] = ''  # additional comment (optional)
        processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
        processName = 'worker ' + str(count) + " at " + str(cursorStart)
        processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))

        processList.append(processData)
        processData['handler'].start()

        cursorStart = cursorEnd
        cursorEnd += stepSize
        count += 1

    # ----------------
    # Run the monitor to look after the workers
    # ----------------
    while True:
        runningCount = 0

        #Worker communication format:
        #STATUS:COMMENTS

        #STATUS:
        #- WORKING - worker is working
        #- CLOSED - worker has finished his job and closed pipe-connection

        #COMMENTS:
        #- for WORKING status:
        #DOCID,COUNT,COMMENTS
        #DOCID - current document ID the worker is working on
        #COUNT - count of done documents
        #COMMENTS - additional comments (optional)


        # ----------------
        # Run through the list of workers ...
        # ----------------
        for i, process in enumerate(processList):
            if process['handler'].is_alive():
                runningCount += 1

                # ----------------
                # .. and check if there is somethng in the PIPE
                # ----------------
                if process['con_parent'].poll():
                    try:
                        message = process['con_parent'].recv()
                        status = message.split(':')[0]
                        comment = message.split(':')[1]

                        # ----------------
                        # Some code is excluded for clarity sake
                        # Update worker's information and progress in "processList"
                        # ----------------

                    except EOFError:
                        print("EOF----")

                # ----------------
                # Some code is excluded for clarity sake
                # Here I draw some progress lines per workers
                # ----------------

            else:
                # worker has finished his job. Close the connection.
                process['con_parent'].close()

        # Whait for some time and monitor again
        sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])


    print("================")
    print("**** DONE ! ****")
    print("================")

    # ----------------
    # Here I'm measuring memory usage to find the most "gluttonous" part of the code
    # ----------------
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')

    print("[ Memory trace - Top 10 ]")
    for stat in top_stats[:10]:
        print(stat)


'''

2 个答案:

答案 0 :(得分:1)

带有偶然性的内存泄漏

处理大量数据时的内存问题似乎是一个已知问题,请参见一些相关的github问题:

不幸的是,看起来还没有一个好的解决方案。

合法化

看看您的特定词形分解任务,我认为您的示例代码有点过于简化,因为您是在单个单词上运行完整的spacy管道,然后对结果不做任何事情(甚至不检查词形? ),因此很难说出您实际想要做什么。

我假设您只是想进行定形,因此,通常,您希望尽可能多地禁用不使用的管道部分(尤其是在仅进行定形的情况下进行解析,请参见{{3 }}),然后使用nlp.pipe批量处理文档。如果您使用解析器或实体识别,Spacy无法处理非常长的文档,因此您需要以某种方式分解文本(或者仅用于词法修饰/标记,您可以增加nlp.max_length需要)。

像在您的示例中那样,将文档分解成单个单词会破坏大多数spacy分析的目的(您通常无法有意义地标记或解析单个单词),而且以这种方式调用spacy会非常慢。

查找词形化

如果您只是出于上下文需要普通词的引理(标记器将不会提供任何有用的信息),则可以查看查找词词匹配器是否足以胜任您的任务,并跳过其余的处理过程:

from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))

输出:

  

['duck'] ['duck']

这只是一个静态查找表,因此在未知单词或大写字母(例如“ wugs”或“ DUCKS”)的大写字母方面效果不佳,因此,您必须查看它是否能很好地处理文本,但如果没有内存泄漏,它将更快。 (您也可以不加任何修饰地自己使用表格,它位于https://spacy.io/usage/processing-pipelines#disabling。)

更好的去词皮化

否则,请使用类似以下的内容来批量处理文本:

nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
  for token in doc:
    print(token.lemma_)

如果您处理一个长文本(或对许多较短的文本使用nlp.pipe())而不是处理单个单词,则应该能够在一个线程中每秒标记/定形(很多)数千个单词。< / p>

答案 1 :(得分:0)

对于将来登陆这个的人,我发现了一个似乎运作良好的黑客:

import spacy
import en_core_web_lg
import multiprocessing

docs = ['Your documents']

def process_docs(docs, n_processes=None):
    # Load the model inside the subprocess, 
    # as that seems to be the main culprit of the memory issues
    nlp = en_core_web_lg.load()

    if not n_processes:
        n_processes = multiprocessing.cpu_count()

    processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]


    # Then do what you wish beyond this point. I end up writing results out to s3.
    pass

for x in range(10):
    # This will spin up a subprocess, 
    # and everytime it finishes it will release all resources back to the machine.
    with multiprocessing.Manager() as manager:
        p = multiprocessing.Process(target=process_docs, args=(docs))
        p.start()
        p.join()

这里的想法是将与 Spacy 相关的所有内容都放入一个子进程中,以便在子进程完成后释放所有内存。我知道它正在工作,因为我实际上可以看到每次子进程完成时内存被释放回实例(实例也不再崩溃 xD)。

完全披露:我不知道为什么 Spacy 似乎会在内存中加班,我已经阅读了所有内容试图找到一个简单的答案,而且我看到的所有 github 问题都声称他们已经解决了这个问题当我在 AWS Sagemaker 实例上使用 Spacy 时,我仍然看到这种情况发生。

希望这对某人有所帮助!我知道我花了好几个小时来解决这个问题。

感谢另一个 SO answer,它更详细地解释了 Python 中的子进程。