我有一个10,000 adam_id
的数据库。对于每个adam_id
,我需要通过API提取信息。
我的表格如下:
`title`
- adam_id
- success (boolean)
- number_of_tries (# of times success=0 when trying to do the pull down)
这是我想要创建的功能:
def pull_down(cursor):
work_remains = True
while work_remains:
cursor.execute("""SELECT adam_id FROM title WHERE success=0
AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
if len(cursor.fetchall()):
adam_id = cursor.fetchone()[0]
do_api_call(adam_id)
else:
work_remains = False
def do_api_call(adam_id):
# do api call
if success:
cursor.execute("UPDATE title SET success=1 WHERE adam_id = adam_id")
else:
cursor.execute("UPDATE title SET number_of_tries+=1 WHERE adam_id=adam_id")
我如何使用python的多处理功能而不是使用一个同步过程来完成上述n
工作者的操作?我已经开始查看多处理模块(http://docs.python.org/library/multiprocessing.html),但到目前为止我似乎很难消化。
答案 0 :(得分:1)
如果工作的重要部分是api调用,因为它转到外部资源,那么这将是你真正想要并行的唯一部分。数据库调用可能非常快。所以你可以试试这个:
adam_id
值这是一个粗略的伪代码示例,用于显示逻辑流程:
from multiprocessing import Pool
def pull_down(cursor):
# get all the data in one query
count = cursor.execute("""SELECT adam_id FROM title WHERE success=0
AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
if count:
# Step #1
adam_id_list = [row[0] for row in cursor.fetchall()]
# Step #2
pool = Pool(4)
results = pool.map(do_api_call, adam_id_list)
pool.close()
# Step #3
update_db(results)
def do_api_call(adam_id):
# do api call
success = call_api_with_id(adam_id)
return (adam_id, success)
def update_db(results):
# loop over results and built batch queries for the success
# or failed items
# (obviously this split up could be optimized)
succeeded = [result[0] for result in results if result[1]]
failed = [result[0] for result in results if not result[1]]
submit_success(succeeded)
submit_failed(failed)
如果你试图使数据库调用并行,那只会使代码复杂化,因为那时你必须正确地给每个进程提供它自己的连接,而实际上它不会让数据库减慢你的速度。