Question

我有一个10,000 adam_id的数据库。对于每个adam_id，我需要通过API提取信息。

我的表格如下：

`title`
- adam_id
- success (boolean)
- number_of_tries (# of times success=0 when trying to do the pull down)

这是我想要创建的功能：

def pull_down(cursor):
    work_remains = True
    while work_remains:
        cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                          AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
        if len(cursor.fetchall()):
            adam_id = cursor.fetchone()[0]
            do_api_call(adam_id)
        else:
            work_remains = False

def do_api_call(adam_id):
    # do api call
    if success:
        cursor.execute("UPDATE title SET success=1 WHERE adam_id = adam_id")
    else:
        cursor.execute("UPDATE title SET number_of_tries+=1 WHERE adam_id=adam_id")

我如何使用python的多处理功能而不是使用一个同步过程来完成上述n工作者的操作？我已经开始查看多处理模块（http://docs.python.org/library/multiprocessing.html），但到目前为止我似乎很难消化。

Answer 1

如果工作的重要部分是api调用，因为它转到外部资源，那么这将是你真正想要并行的唯一部分。数据库调用可能非常快。所以你可以试试这个：

批量获取一个查询中的adam_id值
将ID放入进程池以执行API调用
获取结果并将其提交到数据库

这是一个粗略的伪代码示例，用于显示逻辑流程：

from multiprocessing import Pool

def pull_down(cursor):
    # get all the data in one query
    count = cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                      AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
    if count:
        # Step #1
        adam_id_list = [row[0] for row in cursor.fetchall()]

        # Step #2
        pool = Pool(4)
        results = pool.map(do_api_call, adam_id_list)
        pool.close()

        # Step #3
        update_db(results)

def do_api_call(adam_id):
    # do api call
    success = call_api_with_id(adam_id)
    return (adam_id, success)

def update_db(results):
    # loop over results and built batch queries for the success
    # or failed items

    # (obviously this split up could be optimized)
    succeeded = [result[0] for result in results if result[1]]
    failed = [result[0] for result in results if not result[1]]

    submit_success(succeeded)
    submit_failed(failed)

如果你试图使数据库调用并行，那只会使代码复杂化，因为那时你必须正确地给每个进程提供它自己的连接，而实际上它不会让数据库减慢你的速度。

在python中分配工作负载

1 个答案: