我正在尝试使用Python的mulitprocessing
库在单独的流程中训练来自Scikit-Learn的一系列KMeans集群模型。当我尝试使用multiprocess.Pool
训练模型时,代码运行时未引发任何运行时错误,但执行从未完成。
进一步的调查显示,当训练数据(以下代码段中的X
)的内存大小超过2 ^ 16 = 65536字节时,仅 代码无法终止。小于该值,并且代码的行为符合预期。
import sys
import numpy as np
from multiprocessing import Pool
from sklearn.cluster import KMeans
# The Code Below Executes and Completes with MULTIPLIER = 227 but not when MULTIPLIER = 228
MULTIPLIER = 227
# Some Random Training Data
X = np.array(
[[ 0.19276125, -0.05182922, -0.06014779, 0.06234482, -0.00727767, -0.05975948],
[ 0.3541313, -0.29502648, 0.3088767, 0.02438405, -0.01978588, -0.00060496],
[ 0.22324295, -0.04291656, -0.0991894, 0.04455933, -0.00290042, 0.0316047 ],
[ 0.30497936, -0.03115212, -0.26681659, -0.00742825, 0.00978793, 0.00555566],
[ 0.1584528, -0.01984878, -0.03908984, -0.03246589, -0.01520335, -0.02516451],
[ 0.16888249, -0.04196552, -0.02432088, -0.02362059, 0.0353778, 0.02663082]]
* MULTIPLIER)
# Prints 65488 when MULTIPLIER = 227 and 65776 when MULTIPLIER = 228
print("Input Data Size: ", sys.getsizeof(X))
# Training without Multiprocessing Always Works Regardless of the Size of X
no_multiprocessing = KMeans(n_clusters=2, n_jobs=1).fit(X)
print("Training without multiprocessing complete!") # Always prints
# Training with Mulitprocessing Fails when X is too Large
def run_kmeans(X):
return KMeans(n_clusters=2, n_jobs=1).fit(X)
with Pool(processes=1) as p:
yes_multiprocessing = p.map(run_kmeans, [X])
print("Training with multiprocessing complete!") # Doesn't print when MULTIPLIER = 228
我总是非常小心地将n_jobs
参数设置为1
或None
,以免我的进程衍生出自己的进程。
奇怪的是,此内存限制似乎没有内置到multiprocessing.Pool
中作为“每个元素”的内存限制,因为我可以传入一个很长的字符串(占用65536字节以上)和代码无条件终止。
import sys
from multiprocessing import Pool
my_string = "This sure is a silly string" * 2500
print("String size:", sys.getsizeof(y)) # Prints 79554
def add_exclamation(x):
return x + "!"
with Pool(processes=1) as p:
my_string = p.map(add_exclamation, [my_string])
print("Multiprocessing Completed!") # Prints Just Fine
在第一个代码段挂断时终止执行始终会导致以下错误消息:
File "/path/to/my/code", line 29, in <module>
yes_multiprocessing = p.map(run_kmeans, [X])
File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 638, in get
self.wait(timeout)
File "/.../anaconda3/envs/Main36Env/lib/python3.6/multiprocessing/pool.py", line 635, in wait
self._event.wait(timeout)
File "/.../anaconda3/envs/Main36Env/lib/python3.6/threading.py", line 551, in wait
signaled = self._cond.wait(timeout)
File "/.../anaconda3/envs/Main36Env/lib/python3.6/threading.py", line 295, in wait
waiter.acquire()
KeyboardInterrupt
我已尝试按照建议here强制MacOS系统生成进程而不是分叉进程。我已经研究了making sure all relevant code exists within a with
block和avoiding an iPython environment之类的建议(直接从终端执行python代码)无济于事。更改Pool
进程的数量也没有影响。我还尝试过从multiprocessing.Pool
切换到multiprocessing.Process
,以避免守护进程Pool
试图从KMeans joblib
集成中生成进程,如here所述,没有成功。
如何使用超过65536字节的训练数据在单独的流程上训练多个KMeans模型?
答案 0 :(得分:0)
经过更多的试验和错误之后,这个问题似乎是环境错误,因为在全新的环境中运行上述代码可以正常工作。我不完全确定哪个软件包引起了问题。