带有偶然性的内存泄漏

Question

1-问题

我正在python上使用“ spacy”对文本文档进行词素化。有500,000个文档，其原始文本的大小最大为20 Mb。

问题出在以下方面：直到整个内存都用完为止，spacy内存的消耗才随着时间增长。

2-背景

我的硬件配置： CPU：Intel I7-8700K 3.7 GHz（12核心）记忆体：16 Gb 固态硬盘：1 TB 板载GPU，但不用于此任务

我正在使用“多处理”将任务划分为多个流程（工作人员）。每个工作人员都会收到一份要处理的文件清单。主进程执行子进程的监视。我在每个子进程中都启动一次“ spacy”，并使用这个spacy实例处理工作程序中的整个文档列表。

内存跟踪显示以下内容：

[内存跟踪-前10名]

/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68：大小= 45.1 MiB，计数= 99，平均值= 467 KiB

/opt/develop/virtualenv/lib/python3.6/posixpath.py:149：大小= 40.3 MiB，计数= 694225，平均值= 61 B

：487：大小= 9550 KiB，计数= 77746，平均值= 126 B

/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33：size = 7901 KiB，count = 6，平均值= 1317 KiB

/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/zh-CN/lemmatizer/_nouns.py:7114：size = 5273 KiB，count = 57494，平均值= 94 B

prepare_docs04.py:372：大小= 4189 KiB，计数= 1，平均值= 4189 KiB

/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93：size = 3949 KiB，count = 5，平均值= 790 KiB

/usr/lib/python3.6/json/decoder.py:355：大小= 1837 KiB，计数= 20456，平均值= 92 B

/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/zh-CN/lemmatizer/_adjectives.py:2828：大小= 1704 KiB，计数= 20976，平均值= 83 B

prepare_docs04.py:373：大小= 1633 KiB，计数= 1，平均值= 1633 KiB

3-期望

我看到了一个很好的建议，那就是在此构建一个独立的服务器-客户端解决方案Is possible to keep spacy in memory to reduce the load time?

是否可以使用“多处理”方法来控制内存消耗？

4-代码

这是我的代码的简化版本：

import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep

# START: memory trace
tracemalloc.start()

# Load spacy
spacyMorph = spacy.load("en_core_web_sm")

#
# Get word's lemma
#
def getLemma(word):
    global spacyMorph
    lemmaOutput = spacyMorph(str(word))
    return lemmaOutput


#
# Worker's logic
#
def workerNormalize(lock, conn, params):
    documentCount = 1
    for filenameRaw in params[1]:
        documentTotal = len(params[1])
        documentID = int(os.path.basename(filenameRaw).split('.')[0])

        # Send to the main process the worker's current progress
        if not lock is None:
            lock.acquire()
            try:
                statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
                conn.send(statusMessage)
                documentCount += 1
            finally:
                lock.release()
        else:
            print(statusMessage)

        # ----------------
        # Some code is excluded for clarity sake
        # I've got a "wordList" from file "filenameRaw"
        # ----------------

        wordCount = 1
        wordTotalCount = len(wordList)

        for word in wordList:
            lemma = getLemma(word)
            wordCount += 1

        # ----------------
        # Then I collect all lemmas and save it to another text file
        # ----------------

        # Here I'm trying to reduce memory usage
        del wordList
        del word
        gc.collect()


if __name__ == '__main__':
    lock = Lock()
    processList = []

    # ----------------
    # Some code is excluded for clarity sake
    # Here I'm getting full list of files "fileTotalList" which I need to lemmatize
    # ----------------
    while cursorEnd < (docTotalCount + stepSize):
        fileList = fileTotalList[cursorStart:cursorEnd]

        # ----------------
        # Create workers and populate it with list of files to process
        # ----------------
        processData = {}
        processData['total'] = len(fileList)  # worker total progress
        processData['count'] = 0  # worker documents done count
        processData['currentDocID'] = 0  # current document ID the worker is working on
        processData['comment'] = ''  # additional comment (optional)
        processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
        processName = 'worker ' + str(count) + " at " + str(cursorStart)
        processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))

        processList.append(processData)
        processData['handler'].start()

        cursorStart = cursorEnd
        cursorEnd += stepSize
        count += 1

    # ----------------
    # Run the monitor to look after the workers
    # ----------------
    while True:
        runningCount = 0

        #Worker communication format:
        #STATUS:COMMENTS

        #STATUS:
        #- WORKING - worker is working
        #- CLOSED - worker has finished his job and closed pipe-connection

        #COMMENTS:
        #- for WORKING status:
        #DOCID,COUNT,COMMENTS
        #DOCID - current document ID the worker is working on
        #COUNT - count of done documents
        #COMMENTS - additional comments (optional)


        # ----------------
        # Run through the list of workers ...
        # ----------------
        for i, process in enumerate(processList):
            if process['handler'].is_alive():
                runningCount += 1

                # ----------------
                # .. and check if there is somethng in the PIPE
                # ----------------
                if process['con_parent'].poll():
                    try:
                        message = process['con_parent'].recv()
                        status = message.split(':')[0]
                        comment = message.split(':')[1]

                        # ----------------
                        # Some code is excluded for clarity sake
                        # Update worker's information and progress in "processList"
                        # ----------------

                    except EOFError:
                        print("EOF----")

                # ----------------
                # Some code is excluded for clarity sake
                # Here I draw some progress lines per workers
                # ----------------

            else:
                # worker has finished his job. Close the connection.
                process['con_parent'].close()

        # Whait for some time and monitor again
        sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])


    print("================")
    print("**** DONE ! ****")
    print("================")

    # ----------------
    # Here I'm measuring memory usage to find the most "gluttonous" part of the code
    # ----------------
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')

    print("[ Memory trace - Top 10 ]")
    for stat in top_stats[:10]:
        print(stat)


'''

Answer 1

带有偶然性的内存泄漏

处理大量数据时的内存问题似乎是一个已知问题，请参见一些相关的github问题：

不幸的是，看起来还没有一个好的解决方案。

合法化

看看您的特定词形分解任务，我认为您的示例代码有点过于简化，因为您是在单个单词上运行完整的spacy管道，然后对结果不做任何事情（甚至不检查词形？），因此很难说出您实际想要做什么。

我假设您只是想进行定形，因此，通常，您希望尽可能多地禁用不使用的管道部分（尤其是在仅进行定形的情况下进行解析，请参见{{3 }}），然后使用nlp.pipe批量处理文档。如果您使用解析器或实体识别，Spacy无法处理非常长的文档，因此您需要以某种方式分解文本（或者仅用于词法修饰/标记，您可以增加nlp.max_length需要）。

像在您的示例中那样，将文档分解成单个单词会破坏大多数spacy分析的目的（您通常无法有意义地标记或解析单个单词），而且以这种方式调用spacy会非常慢。

查找词形化

如果您只是出于上下文需要普通词的引理（标记器将不会提供任何有用的信息），则可以查看查找词词匹配器是否足以胜任您的任务，并跳过其余的处理过程：

from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))

输出：

['duck'] ['duck']

这只是一个静态查找表，因此在未知单词或大写字母（例如“ wugs”或“ DUCKS”）的大写字母方面效果不佳，因此，您必须查看它是否能很好地处理文本，但如果没有内存泄漏，它将更快。（您也可以不加任何修饰地自己使用表格，它位于https://spacy.io/usage/processing-pipelines#disabling。）

更好的去词皮化

否则，请使用类似以下的内容来批量处理文本：

nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
  for token in doc:
    print(token.lemma_)

如果您处理一个长文本（或对许多较短的文本使用nlp.pipe()）而不是处理单个单词，则应该能够在一个线程中每秒标记/定形（很多）数千个单词。< / p>

Answer 2

对于将来登陆这个的人，我发现了一个似乎运作良好的黑客：

import spacy
import en_core_web_lg
import multiprocessing

docs = ['Your documents']

def process_docs(docs, n_processes=None):
    # Load the model inside the subprocess, 
    # as that seems to be the main culprit of the memory issues
    nlp = en_core_web_lg.load()

    if not n_processes:
        n_processes = multiprocessing.cpu_count()

    processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]


    # Then do what you wish beyond this point. I end up writing results out to s3.
    pass

for x in range(10):
    # This will spin up a subprocess, 
    # and everytime it finishes it will release all resources back to the machine.
    with multiprocessing.Manager() as manager:
        p = multiprocessing.Process(target=process_docs, args=(docs))
        p.start()
        p.join()

这里的想法是将与 Spacy 相关的所有内容都放入一个子进程中，以便在子进程完成后释放所有内存。我知道它正在工作，因为我实际上可以看到每次子进程完成时内存被释放回实例（实例也不再崩溃 xD）。

完全披露：我不知道为什么 Spacy 似乎会在内存中加班，我已经阅读了所有内容试图找到一个简单的答案，而且我看到的所有 github 问题都声称他们已经解决了这个问题当我在 AWS Sagemaker 实例上使用 Spacy 时，我仍然看到这种情况发生。

希望这对某人有所帮助！我知道我花了好几个小时来解决这个问题。

感谢另一个 SO answer，它更详细地解释了 Python 中的子进程。

Python：空间和内存消耗

1-问题

2-背景

3-期望

4-代码

2 个答案:

带有偶然性的内存泄漏

合法化

查找词形化

更好的去词皮化