我正在寻找python multiprocessing
的简单示例。
我正在尝试找出python multiprocessing
的可行示例。我找到了一个将大量分解为素数的例子。之所以可行,是因为几乎没有输入(每个内核有大量输入),却有很多计算(将数字分解为质数)。
但是,我的兴趣有所不同-我有很多输入数据,可以根据这些输入数据执行简单的计算。我想知道是否有一种简单的方法来修改以下代码,以使多核真正胜过单核。我在具有4个物理核心和16 GB RAM的Win10计算机上运行python 3.6。
这是我的示例代码。
import numpy as np
import multiprocessing as mp
import timeit
# comment the following line to get version without queue
queue = mp.Queue()
cores_no = 4
def npv_zcb(bnd_info, cores_no):
bnds_no = len(bnd_info)
npvs = []
for bnd_idx in range(bnds_no):
nom = bnd_info[bnd_idx][0]
mat = bnd_info[bnd_idx][1]
yld = bnd_info[bnd_idx][2]
npvs.append(nom / ((1 + yld) ** mat))
if cores_no == 1:
return npvs
# comment the following two lines to get version without queue
else:
queue.put(npvs)
# generate random attributes of zero coupon bonds
print('Generating random zero coupon bonds...')
bnds_no = 100
bnd_info = np.zeros([bnds_no, 3])
bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
bnd_info = bnd_info.tolist()
# single core
print('Running single core...')
start = timeit.default_timer()
npvs = npv_zcb(bnd_info, 1)
print(' elapsed time: ', timeit.default_timer() - start, ' seconds')
# multiprocessing
print('Running multiprocessing...')
print(' ', cores_no, ' core(s)...')
start = timeit.default_timer()
processes = []
idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
idx.append(bnds_no + 1)
for core_idx in range(cores_no):
input_data = bnd_info[idx[core_idx]: idx[core_idx + 1]]
process = mp.Process(target=npv_zcb,
args=(input_data, cores_no))
processes.append(process)
process.start()
for process_aux in processes:
process_aux.join()
# comment the following three lines to get version without queue
mylist = []
while not queue.empty():
mylist.append(queue.get())
print(' elapsed time: ', timeit.default_timer() - start, ' seconds')
如果有人可以建议我如何修改代码,使多核运行优于单核运行,我将不胜感激。我还注意到,将变量bnds_no
增加到1,000会导致BrokenPipeError
。人们会期望增加输入量会导致更长的计算时间而不是一个错误……这是怎么回事?
答案 0 :(得分:1)
BrokenPipeError
不是由于输入较大,而是由于竞态条件,而竞态条件是由于在单独的步骤中使用queue.empty()
和queue.get()
而引起的。
在大多数情况下,您不会看到输入较小的原因是因为队列项目的处理速度非常快,并且不会发生竞争条件,但是对于较大的数据集,竞争条件的机会会增加。
即使输入较小,也可以尝试多次运行脚本,也许10到15次,您会看到BrokenPipeError
发生。
对此的一种解决方案是将哨兵值传递到队列,您可以使用该值来测试队列中的所有数据是否已处理。
尝试将代码修改为类似的内容
q = mp.Queue()
<put the data in the queue>
q.put(None)
while True:
data = q.get()
if data is not None:
<process the data here >
else:
q.put(None)
return
答案 1 :(得分:0)
这并不能直接回答您的问题,但是如果您使用RxPy进行反应性Python编程,则可以查看他们关于多重处理的小例子:https://github.com/ReactiveX/RxPY/tree/release/v1.6.x#concurrency
使用ReactiveX / RxPy管理并发似乎比尝试手动进行要容易一些。
答案 2 :(得分:0)
好,所以我从代码中删除了与队列相关的部分,以查看是否摆脱了BrokenPipeError
(上面我更新了原始代码,指出应该注释掉的内容)。不幸的是,它没有帮助。
我在装有Linux(Ubuntu 18.10,python 3.6.7)的个人PC上测试了代码。令人惊讶的是,代码在两个系统上的行为不同。在Linux上,没有队列的版本可以正常运行。带队列的版本将永远运行。在Windows上没有区别-我总是以BrokenPipeError
结尾。
PS:在其他一些帖子(No multiprocessing print outputs (Spyder))中,我发现使用Spyder编辑器时多处理可能存在一些问题。我在Windows计算机上遇到了完全相同的问题。因此,并非官方文档中的所有示例都能按预期工作...
答案 3 :(得分:0)
这不能回答您的问题,我只是发布它来说明我在评论中关于多处理何时能够加快处理速度的说法。
在下面基于您的代码中,我添加了一个REPEAT
常量,使npv_zcb()
反复进行多次计算,以使用CPU进行更多模拟。更改此常数的值通常会比单核处理放慢或加快多核处理部分的速度,实际上,它几乎不会影响多核处理部分。
import numpy as np
import multiprocessing as mp
import timeit
np.random.seed(42) # Generate same set of random numbers for testing.
REPEAT = 10 # Number of times to repeat computations performed in npv_zcb.
def npv_zcb(bnd_info, queue):
npvs = []
for _ in range(REPEAT): # To simulate more computations.
for bnd_idx in range(len(bnd_info)):
nom = bnd_info[bnd_idx][0]
mat = bnd_info[bnd_idx][1]
yld = bnd_info[bnd_idx][2]
v = nom / ((1 + yld) ** mat)
npvs.append(v)
if queue:
queue.put(npvs)
else:
return npvs
if __name__ == '__main__':
print('Generating random zero coupon bonds...')
print()
bnds_no = 100
cores_no = 4
# generate random attributes of zero coupon bonds
bnd_info = np.zeros([bnds_no, 3])
bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
bnd_info = bnd_info.tolist()
# single core
print('Running single core...')
start = timeit.default_timer()
npvs = npv_zcb(bnd_info, None)
print(' elapsed time: {:.6f} seconds'.format(timeit.default_timer() - start))
# multiprocessing
print()
print('Running multiprocessing...')
print(' ', cores_no, ' core(s)...')
start = timeit.default_timer()
queue = mp.Queue()
processes = []
idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
idx.append(bnds_no + 1)
for core_idx in range(cores_no):
input_data = bnd_info[idx[core_idx]: idx[core_idx + 1]]
process = mp.Process(target=npv_zcb, args=(input_data, queue))
processes.append(process)
process.start()
for process in processes:
process.join()
mylist = []
while not queue.empty():
mylist.extend(queue.get())
print(' elapsed time: {:.6f} seconds'.format(timeit.default_timer() - start))
答案 4 :(得分:0)
好-终于我找到了答案。多重处理在Windows上不起作用。以下代码可以在Ubuntu(Ubuntu 19.04&python 3.7)上正常运行,但不能在Windows(Win10&python 3.6)上正常运行。希望对其他人有帮助...
import pandas as pd
import numpy as np
import csv
import multiprocessing as mp
import timeit
def npv_zcb(bnd_file, delimiter=','):
"""
Michal Mackanic
06/05/2019 v1.0
Load bond positions from a .csv file, value the bonds and save results
back to a .csv file.
inputs:
bnd_file: str
full path to a .csv file with bond positions
delimiter: str
delimiter to be used in .csv file
outputs:
a .csv file with additional field npv.
dependencies:
example:
npv_zcb('C:\\temp\\bnd_aux.csv', ',')
"""
# load the input file as a dataframe
bnd_info = pd.read_csv(bnd_file,
sep=delimiter,
quoting=2, # csv.QUOTE_NONNUMERIC
doublequote=True,
low_memory=False)
# convert dataframe into list of dictionaries
bnd_info = bnd_info.to_dict(orient='records')
# get number of bonds in the file
bnds_no = len(bnd_info)
# go bond by bond
for bnd_idx in range(bnds_no):
mat = bnd_info[bnd_idx]['maturity']
nom = bnd_info[bnd_idx]['nominal']
yld = bnd_info[bnd_idx]['yld']
bnd_info[bnd_idx]['npv'] = nom / ((1 + yld) ** mat)
# covert list of dictionaries back to dataframe and save it as .csv file
bnd_info = pd.DataFrame(bnd_info)
bnd_info.to_csv(bnd_file,
sep=delimiter,
quoting=csv.QUOTE_NONNUMERIC,
quotechar='"',
index=False)
return
def main(cores_no, bnds_no, path, delimiter):
# generate random attributes of zero coupon bonds
print('Generating random zero coupon bonds...')
bnd_info = np.zeros([bnds_no, 3])
bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
bnd_info = zip(bnd_info[:, 0], bnd_info[:, 1], bnd_info[:, 2])
bnd_info = [{'maturity': mat,
'nominal': nom,
'yld': yld} for mat, nom, yld in bnd_info]
bnd_info = pd.DataFrame(bnd_info)
# save bond positions into a .csv file
bnd_info.to_csv(path + 'bnd_aux.csv',
sep=delimiter,
quoting=csv.QUOTE_NONNUMERIC,
quotechar='"',
index=False)
# prepare one .csv file per core
print('Preparing input files...')
idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
idx.append(bnds_no + 1)
for core_idx in range(cores_no):
# save bond positions into a .csv file
file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
bnd_info_aux = bnd_info[idx[core_idx]: idx[core_idx + 1]]
bnd_info_aux.to_csv(file_name,
sep=delimiter,
quoting=csv.QUOTE_NONNUMERIC,
quotechar='"',
index=False)
# SINGLE CORE
print('Running single core...')
start = timeit.default_timer()
# evaluate bond positions
npv_zcb(path + 'bnd_aux.csv', delimiter)
print(' elapsed time: ', timeit.default_timer() - start, ' seconds')
# MULTIPLE CORES
if __name__ == '__main__':
# spread calculation among several cores
print('Running multiprocessing...')
print(' ', cores_no, ' core(s)...')
start = timeit.default_timer()
processes = []
# go core by core
print(' spreading calculation among processes...')
for core_idx in range(cores_no):
# run calculations
file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
process = mp.Process(target=npv_zcb,
args=(file_name, delimiter))
processes.append(process)
process.start()
# wait till every process is finished
print(' waiting for all processes to finish...')
for process in processes:
process.join()
print(' elapsed time: ', timeit.default_timer() - start, ' seconds')
main(cores_no=2,
bnds_no=1000000,
path='/home/macky/',
delimiter=',')
答案 5 :(得分:0)
在一位同事的帮助下,我得以编写出实际上按预期运行的简单代码。我快要在那里了-我的代码需要进行一些微妙(但很关键)的修改。要运行代码,请打开anaconda提示符,键入python -m idlelib
,打开文件并运行。
import pandas as pd
import numpy as np
import csv
import multiprocessing as mp
import timeit
def npv_zcb(core_idx, bnd_file, delimiter=','):
"""
Michal Mackanic
06/05/2019 v1.0
Load bond positions from a .csv file, value the bonds and save results
back to a .csv file.
inputs:
bnd_file: str
full path to a .csv file with bond positions
delimiter: str
delimiter to be used in .csv file
outputs:
a .csv file with additional field npv.
dependencies:
example:
npv_zcb('C:\\temp\\bnd_aux.csv', ',')
"""
# core idx
print(' npv_zcb() starting on core ' + str(core_idx))
# load the input file as a dataframe
bnd_info = pd.read_csv(bnd_file,
sep=delimiter,
quoting=2, # csv.QUOTE_NONNUMERIC
header=0,
doublequote=True,
low_memory=False)
# convert dataframe into list of dictionaries
bnd_info = bnd_info.to_dict(orient='records')
# get number of bonds in the file
bnds_no = len(bnd_info)
# go bond by bond
for bnd_idx in range(bnds_no):
mat = bnd_info[bnd_idx]['maturity']
nom = bnd_info[bnd_idx]['nominal']
yld = bnd_info[bnd_idx]['yld']
bnd_info[bnd_idx]['npv'] = nom / ((1 + yld) ** mat)
# covert list of dictionaries back to dataframe and save it as .csv file
bnd_info = pd.DataFrame(bnd_info)
bnd_info.to_csv(bnd_file,
sep=delimiter,
quoting=csv.QUOTE_NONNUMERIC,
quotechar='"',
index=False)
# core idx
print(' npv_zcb() finished on core ' + str(core_idx))
# everything OK
return True
def main(cores_no, bnds_no, path, delimiter):
if __name__ == '__main__':
mp.freeze_support()
# generate random attributes of zero coupon bonds
print('Generating random zero coupon bonds...')
bnd_info = np.zeros([bnds_no, 3])
bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
bnd_info = zip(bnd_info[:, 0], bnd_info[:, 1], bnd_info[:, 2])
bnd_info = [{'maturity': mat,
'nominal': nom,
'yld': yld} for mat, nom, yld in bnd_info]
bnd_info = pd.DataFrame(bnd_info)
# save bond positions into a .csv file
bnd_info.to_csv(path + 'bnd_aux.csv',
sep=delimiter,
quoting=csv.QUOTE_NONNUMERIC,
quotechar='"',
index=False)
# prepare one .csv file per core
print('Preparing input files...')
idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
idx.append(bnds_no + 1)
for core_idx in range(cores_no):
# save bond positions into a .csv file
file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
bnd_info_aux = bnd_info[idx[core_idx]: idx[core_idx + 1]]
bnd_info_aux.to_csv(file_name,
sep=delimiter,
quoting=csv.QUOTE_NONNUMERIC,
quotechar='"',
index=False)
# SINGLE CORE
print('Running single core...')
start = timeit.default_timer()
# evaluate bond positions
npv_zcb(1, path + 'bnd_aux.csv', delimiter)
print(' elapsed time: ', timeit.default_timer() - start, ' seconds')
# MULTIPLE CORES
# spread calculation among several cores
print('Running multiprocessing...')
print(' ', cores_no, ' core(s)...')
start = timeit.default_timer()
processes = []
# go core by core
print(' spreading calculation among processes...')
for core_idx in range(cores_no):
# run calculations
file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
process = mp.Process(target=npv_zcb,
args=(core_idx, file_name, delimiter))
processes.append(process)
process.start()
# wait till every process is finished
print(' waiting for all processes to finish...')
for process in processes:
process.join()
print(' elapsed time: ', timeit.default_timer() - start, ' seconds')
main(cores_no=2,
bnds_no=1000000,
path='C:\\temp\\',
delimiter=',')