Python使用循环多处理大型列表

时间:2017-08-13 06:13:12

标签: python loops proxy multiprocessing python-requests

老实说,我甚至不确定这个问题的标题。我试图遍历大量的URL列表,但一次只处理20个URL(20个基于我有多少个代理)。但我还需要在代理列表中循环,因为我正在处理URL。因此,例如,它将从第一个URL和第一个代理开始,一旦它到达第21个URL,它将再次使用第一个代理。以下是我可怜的例子,如果有人能指出我正确的方向,我将不胜感激。

import pymysql.cursors
from multiprocessing import Pool
from fake_useragent import UserAgent

def worker(args):
    var_a, id, name, content, proxy, headers, connection = args
    print (var_a)
    print (id)
    print (name)
    print (content)
    print (proxy)
    print (headers)
    print (connection)
    print ('---------------------------')

if __name__ == '__main__':
    connection = pymysql.connect(
        host = 'host ',
        user = 'user',
        password = 'password',
        db = 'db',
        charset='utf8mb4',
        cursorclass=pymysql.cursors.DictCursor
    )

    ua = UserAgent()
    user_agent = ua.chrome
    headers = {'User-Agent' : user_agent}

    proxies = [
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx',
        'xxx.xxx.xxx.xxx:xxxxx'
    ]

    with connection.cursor() as cursor:
        sql = "SELECT id,name,content FROM table"
        cursor.execute(sql)
        urls = cursor.fetchall()

    var_a = 'static'

    data = ((var_a, url['id'], url['name'], url['content'], proxies[i % len(proxies)], headers, connection) for i, url in enumerate(urls))
    proc_num = 20
    p = Pool(processes=proc_num)
    results = p.imap(worker, data)
    p.close() 
    p.join()

2 个答案:

答案 0 :(得分:1)

您可以使用列表存储新流程。当您达到一定数量的项目时,请为列表中的每个进程调用$new_array = array(); foreach ($easy_decode as $value) { if(array_key_exists($value['name'],$new_array)) $value['count'] += $new_array[$value['name']]['count']; $new_array[$value['name']] = $value; } var_dump($new_array); 。这应该可以控制活动进程的数量。

join


如果您想要一定数量的活动进程,可以尝试if __name__ == '__main__': proc_num = 20 proc_list = [] for i, url in enumerate(urls): proxy = proxies[i % len(proxies)] p = Process(target=worker, args=(url, proxy)) p.start() proc_list.append(p) if i % proc_num == 0 or i == len(urls)-1: for proc in proc_list: proc.join() 模块。只需修改Pool定义即可接收元组。

worker

为了澄清事情,if __name__ == '__main__': data = ((url, proxies[i % len(proxies)]) for i, url in enumerate(urls)) proc_num = 20 p = Pool(processes=proc_num) results = p.imap(worker, data) p.close() p.join() 函数应该接收一个元组然后解压缩它。

worker

答案 1 :(得分:0)

请尝试以下代码:

for i in range(len(urls)):
    url = urls[i] # Current URL
    proxy = proxies[i % len(proxies)] # Current proxy
    # ...