任务是从输入文件中获取记录,处理它们并存储在SQLite数据库中。文件可以有数百万条记录,一条记录的处理速度非常快,但我希望从多处理中获得一些提升。我实现了它,发现有一个瓶颈,因为提升不是那么大。
我无法有效利用所有核心。 3个过程会产生一些明显的效果,更多的过程已经无效。
下面我提供了简化的示例代码,仅用于说明如何创建和管理流程。
经过一番调查后我怀疑:
从磁盘读取数据
序列化/反序列化 - 最不可疑的
传递给流程的数据
锁。我有两个:
什么不是瓶颈:
我在单个进程中使用cProfile
分析了代码。它不是很有用。花在计算阶段的时间最多。
测量小数据子集的执行时间:
# (laptop, 2 cores with hyper-threading, Python 3.5, Ubuntu 16.04, SSD)
serial (old implementation): 28s
parallel (workers = 1): 28s
parallel (workers = 2): 19s
parallel (workers = 3): 17s
parallel (workers = 4): 17s
# (virtual machine on a server, 30 cores, Python 3.4, Ubuntu 14.04, HDD)
parallel (workers = 1): 28s
parallel (workers = 2): 11s
parallel (workers = 3): 10s
parallel (workers = 4): 8s
parallel (workers = 5): 8s
parallel (workers = 6): 8s
问:如何确定瓶颈或至少是一些疑似问题?是否有可能获得超过4倍的增益?
# indigo is an external module
def process(q, conn, cursor, d, lock_db, lock_dict):
data_collector = []
while True:
data = q.get()
if data is None:
break
mol_name = data[1]
mol = indigo.unserialize(data[0]) # <-- unserialization
lock_dict.acquire()
value = d.get(mol_name, None)
if value is None:
value = calc_value(mol)
d[name] = value
lock_dict.release()
# some calculations which return several variables A, B and C
data_collector.append([mol_name, A, B, C])
if len(data_collector) == 1000:
insert_data(conn, cursor, data_collector, lock_db)
data_collector = []
insert_data(conn, cursor, data_collector, lock_db)
with lite.connect(out_fname) as conn:
cur = conn.cursor()
create_tables(cur)
nprocess = max(min(ncpu, cpu_count()), 1)
manager = Manager()
lock_db = manager.Lock()
lock_dict = manager.Lock()
q = manager.Queue(2 * nprocess)
d = manager.dict()
pool = []
for i in range(nprocess):
p = Process(target=process, args=(q, conn, cur, d, lock_db, lock_dict))
p.start()
pool.append(p)
for i, mol in enumerate(indigo.iterateSDFile(file_name)):
q.put((mol.serialize(), mol.name()) # <-- serialization
for _ in range(nprocess):
q.put(None)
for p in pool:
p.join()
for k, v in d.items():
insert_db(cur, k, v)
conn.commit()