在python中分配工作负载

时间:2012-08-23 23:49:06

标签: python performance multiprocessing

我有一个10,000 adam_id的数据库。对于每个adam_id,我需要通过API提取信息。

我的表格如下:

`title`
- adam_id
- success (boolean)
- number_of_tries (# of times success=0 when trying to do the pull down)

这是我想要创建的功能:

def pull_down(cursor):
    work_remains = True
    while work_remains:
        cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                          AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
        if len(cursor.fetchall()):
            adam_id = cursor.fetchone()[0]
            do_api_call(adam_id)
        else:
            work_remains = False

def do_api_call(adam_id):
    # do api call
    if success:
        cursor.execute("UPDATE title SET success=1 WHERE adam_id = adam_id")
    else:
        cursor.execute("UPDATE title SET number_of_tries+=1 WHERE adam_id=adam_id")

我如何使用python的多处理功能而不是使用一个同步过程来完成上述n工作者的操作?我已经开始查看多处理模块(http://docs.python.org/library/multiprocessing.html),但到目前为止我似乎很难消化。

1 个答案:

答案 0 :(得分:1)

如果工作的重要部分是api调用,因为它转到外部资源,那么这将是你真正想要并行的唯一部分。数据库调用可能非常快。所以你可以试试这个:

  1. 批量获取一个查询中的adam_id
  2. 将ID放入进程池以执行API调用
  3. 获取结果并将其提交到数据库
  4. 这是一个粗略的伪代码示例,用于显示逻辑流程:

    from multiprocessing import Pool
    
    def pull_down(cursor):
        # get all the data in one query
        count = cursor.execute("""SELECT adam_id FROM title WHERE success=0 
                          AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""")
        if count:
            # Step #1
            adam_id_list = [row[0] for row in cursor.fetchall()]
    
            # Step #2
            pool = Pool(4)
            results = pool.map(do_api_call, adam_id_list)
            pool.close()
    
            # Step #3
            update_db(results)
    
    def do_api_call(adam_id):
        # do api call
        success = call_api_with_id(adam_id)
        return (adam_id, success)
    
    def update_db(results):
        # loop over results and built batch queries for the success
        # or failed items
    
        # (obviously this split up could be optimized)
        succeeded = [result[0] for result in results if result[1]]
        failed = [result[0] for result in results if not result[1]]
    
        submit_success(succeeded)
        submit_failed(failed)
    

    如果你试图使数据库调用并行,那只会使代码复杂化,因为那时你必须正确地给每个进程提供它自己的连接,而实际上它不会让数据库减慢你的速度。