以下代码模拟生产者和消费者模型,该模型将从外汇经纪商FXCM收集数据并写入数据库。 每个生产者进程都将与代理建立基于会话的连接。
生产者和消费者都将无限期地运行,直到将“毒丸”放入队列,这发生在业务结束时(星期五22:00)。我省略了这部分代码,因为它与问题无关。我找到的所有示例似乎都会产生一个进程,在短时间内完成一些工作,然后join()
返回到父进程。像这个here
如上所述,生产者将无限期地运行,原因是因为登录并与代理创建会话需要大约3秒钟。
当运行下面的代码时,您将看到Queue backlog,尽管在运行实际代码时这似乎更糟糕。
不确定它是否相关,但会话是使用python-forexconnect API创建的,该API是用C ++编写的并使用Boost。
问题是消费者对队列中的get()
项目花费的时间太长,我想知道这种模型是否是接近这种类型开发的正确方法。
感谢您
示例代码
from multiprocessing import Process, Queue, cpu_count
from datetime import datetime, timedelta
import numpy as np
import time
def dummy_data(dtto):
dates = np.array(
[dtto - timedelta(days=i) for i in range(300)])
price_data = np.random.rand(len(dates),5)
return np.concatenate(
(np.vstack(dates),price_data), axis=1)
def get_bars(q2, ms, symbol, dtfm, dtto, time_frame):
stop_date = dtfm
while dtto > stop_date:
data = dummy_data(dtto)
dtfm = data[-1,0]
dtto = data[0,0]
q2.put((symbol, dtfm, dtto))
# Switch to date
dtto = dtfm
def producer(q1,q2):
# client = fx.Client(....)
client = 'broker session'
while True:
job = q1.get()
if job == None:
break
sym, dtfm, dtto, tf = job
# Get price data from broker
get_bars(q2, client, sym, dtfm, dtto, tf)
q2.put(None)
def consumer(q2):
while True:
bars = q2.get()
if bars == None:
break
print(q2.qsize(), bars[0], bars[1], bars[2]) # write to db
q1, q2 = Queue(), Queue()
# instruments = client.get_offers()
# instruments = ['GBP/USD', 'EUR,USD',...]
instruments = range(63) # 62 dummy instruments
# Places jobs into the queue for each symbol
for symbol in instruments:
q1.put((symbol,
datetime(2000,1,14,22,0),
datetime(2018,1,14,22,0),
'D1'))
# Setup producers and consumers
pp, cp = range(6), range(2)
pro = [Process(target=producer, args=(q1, q2,)) for i in pp]
con = [Process(target=consumer, args=(q2,)) for i in cp]
for p in pro: p.start()
for p in con: p.start()
# This is just here to stop this script and does not
# exist in the real version
for i in pp: q1.put(None)
for p in pro: p.join()
for p in con: p.join()
print('stopped')
答案 0 :(得分:1)
Horrible performance of multiprocessing.Queue.get()
is a known problem (several questions on Stackoverflow as well, but no answers that would be generally useful).
Which sort of indicates that you should consider another model. You could see how much process creation overhead is compared to this; do not use permanently running processes at all, but launch a process as soon as you have data ready for it. When you do it like this, your subprocess will receive an in-memory copy of data when your process forks. This adds process creation overhead but removes the queue. You could at least consider this as your consumer writes to database and does not need to report anything back to the parent.
Python is a great language but it is not the best performing when it comes to parallel processing.