我正在python上使用“ spacy”对文本文档进行词素化。 有500,000个文档,其原始文本的大小最大为20 Mb。
问题出在以下方面:直到整个内存都用完为止,spacy内存的消耗才随着时间增长。
我的硬件配置: CPU:Intel I7-8700K 3.7 GHz(12核心) 记忆体:16 Gb 固态硬盘:1 TB 板载GPU,但不用于此任务
我正在使用“多处理”将任务划分为多个流程(工作人员)。 每个工作人员都会收到一份要处理的文件清单。 主进程执行子进程的监视。 我在每个子进程中都启动一次“ spacy”,并使用这个spacy实例处理工作程序中的整个文档列表。
内存跟踪显示以下内容:
[内存跟踪-前10名]
/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68:大小= 45.1 MiB,计数= 99,平均值= 467 KiB
/opt/develop/virtualenv/lib/python3.6/posixpath.py:149:大小= 40.3 MiB,计数= 694225,平均值= 61 B
:487:大小= 9550 KiB,计数= 77746,平均值= 126 B
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33:size = 7901 KiB,count = 6,平均值= 1317 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/zh-CN/lemmatizer/_nouns.py:7114:size = 5273 KiB,count = 57494,平均值= 94 B
prepare_docs04.py:372:大小= 4189 KiB,计数= 1,平均值= 4189 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93:size = 3949 KiB,count = 5,平均值= 790 KiB
/usr/lib/python3.6/json/decoder.py:355:大小= 1837 KiB,计数= 20456,平均值= 92 B
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/zh-CN/lemmatizer/_adjectives.py:2828:大小= 1704 KiB,计数= 20976,平均值= 83 B
prepare_docs04.py:373:大小= 1633 KiB,计数= 1,平均值= 1633 KiB
我看到了一个很好的建议,那就是在此构建一个独立的服务器-客户端解决方案Is possible to keep spacy in memory to reduce the load time?
是否可以使用“多处理”方法来控制内存消耗?
这是我的代码的简化版本:
import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep
# START: memory trace
tracemalloc.start()
# Load spacy
spacyMorph = spacy.load("en_core_web_sm")
#
# Get word's lemma
#
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput
#
# Worker's logic
#
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])
# Send to the main process the worker's current progress
if not lock is None:
lock.acquire()
try:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
conn.send(statusMessage)
documentCount += 1
finally:
lock.release()
else:
print(statusMessage)
# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------
wordCount = 1
wordTotalCount = len(wordList)
for word in wordList:
lemma = getLemma(word)
wordCount += 1
# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------
# Here I'm trying to reduce memory usage
del wordList
del word
gc.collect()
if __name__ == '__main__':
lock = Lock()
processList = []
# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]
# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))
processList.append(processData)
processData['handler'].start()
cursorStart = cursorEnd
cursorEnd += stepSize
count += 1
# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0
#Worker communication format:
#STATUS:COMMENTS
#STATUS:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection
#COMMENTS:
#- for WORKING status:
#DOCID,COUNT,COMMENTS
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)
# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1
# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
try:
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]
# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------
except EOFError:
print("EOF----")
# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------
else:
# worker has finished his job. Close the connection.
process['con_parent'].close()
# Whait for some time and monitor again
sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])
print("================")
print("**** DONE ! ****")
print("================")
# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
print(stat)
'''
答案 0 :(得分:1)
处理大量数据时的内存问题似乎是一个已知问题,请参见一些相关的github问题:
不幸的是,看起来还没有一个好的解决方案。
看看您的特定词形分解任务,我认为您的示例代码有点过于简化,因为您是在单个单词上运行完整的spacy管道,然后对结果不做任何事情(甚至不检查词形? ),因此很难说出您实际想要做什么。
我假设您只是想进行定形,因此,通常,您希望尽可能多地禁用不使用的管道部分(尤其是在仅进行定形的情况下进行解析,请参见{{3 }}),然后使用nlp.pipe
批量处理文档。如果您使用解析器或实体识别,Spacy无法处理非常长的文档,因此您需要以某种方式分解文本(或者仅用于词法修饰/标记,您可以增加nlp.max_length
需要)。
像在您的示例中那样,将文档分解成单个单词会破坏大多数spacy分析的目的(您通常无法有意义地标记或解析单个单词),而且以这种方式调用spacy会非常慢。
如果您只是出于上下文需要普通词的引理(标记器将不会提供任何有用的信息),则可以查看查找词词匹配器是否足以胜任您的任务,并跳过其余的处理过程:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))
输出:
['duck'] ['duck']
这只是一个静态查找表,因此在未知单词或大写字母(例如“ wugs”或“ DUCKS”)的大写字母方面效果不佳,因此,您必须查看它是否能很好地处理文本,但如果没有内存泄漏,它将更快。 (您也可以不加任何修饰地自己使用表格,它位于https://spacy.io/usage/processing-pipelines#disabling。)
否则,请使用类似以下的内容来批量处理文本:
nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
for token in doc:
print(token.lemma_)
如果您处理一个长文本(或对许多较短的文本使用nlp.pipe()
)而不是处理单个单词,则应该能够在一个线程中每秒标记/定形(很多)数千个单词。< / p>
答案 1 :(得分:0)
对于将来登陆这个的人,我发现了一个似乎运作良好的黑客:
import spacy
import en_core_web_lg
import multiprocessing
docs = ['Your documents']
def process_docs(docs, n_processes=None):
# Load the model inside the subprocess,
# as that seems to be the main culprit of the memory issues
nlp = en_core_web_lg.load()
if not n_processes:
n_processes = multiprocessing.cpu_count()
processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]
# Then do what you wish beyond this point. I end up writing results out to s3.
pass
for x in range(10):
# This will spin up a subprocess,
# and everytime it finishes it will release all resources back to the machine.
with multiprocessing.Manager() as manager:
p = multiprocessing.Process(target=process_docs, args=(docs))
p.start()
p.join()
这里的想法是将与 Spacy 相关的所有内容都放入一个子进程中,以便在子进程完成后释放所有内存。我知道它正在工作,因为我实际上可以看到每次子进程完成时内存被释放回实例(实例也不再崩溃 xD)。
完全披露:我不知道为什么 Spacy 似乎会在内存中加班,我已经阅读了所有内容试图找到一个简单的答案,而且我看到的所有 github 问题都声称他们已经解决了这个问题当我在 AWS Sagemaker 实例上使用 Spacy 时,我仍然看到这种情况发生。
希望这对某人有所帮助!我知道我花了好几个小时来解决这个问题。
感谢另一个 SO answer,它更详细地解释了 Python 中的子进程。