老实说,我甚至不确定这个问题的标题。我试图遍历大量的URL列表,但一次只处理20个URL(20个基于我有多少个代理)。但我还需要在代理列表中循环,因为我正在处理URL。因此,例如,它将从第一个URL和第一个代理开始,一旦它到达第21个URL,它将再次使用第一个代理。以下是我可怜的例子,如果有人能指出我正确的方向,我将不胜感激。
import pymysql.cursors
from multiprocessing import Pool
from fake_useragent import UserAgent
def worker(args):
var_a, id, name, content, proxy, headers, connection = args
print (var_a)
print (id)
print (name)
print (content)
print (proxy)
print (headers)
print (connection)
print ('---------------------------')
if __name__ == '__main__':
connection = pymysql.connect(
host = 'host ',
user = 'user',
password = 'password',
db = 'db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
ua = UserAgent()
user_agent = ua.chrome
headers = {'User-Agent' : user_agent}
proxies = [
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx',
'xxx.xxx.xxx.xxx:xxxxx'
]
with connection.cursor() as cursor:
sql = "SELECT id,name,content FROM table"
cursor.execute(sql)
urls = cursor.fetchall()
var_a = 'static'
data = ((var_a, url['id'], url['name'], url['content'], proxies[i % len(proxies)], headers, connection) for i, url in enumerate(urls))
proc_num = 20
p = Pool(processes=proc_num)
results = p.imap(worker, data)
p.close()
p.join()
答案 0 :(得分:1)
您可以使用列表存储新流程。当您达到一定数量的项目时,请为列表中的每个进程调用$new_array = array();
foreach ($easy_decode as $value) {
if(array_key_exists($value['name'],$new_array))
$value['count'] += $new_array[$value['name']]['count'];
$new_array[$value['name']] = $value;
}
var_dump($new_array);
。这应该可以控制活动进程的数量。
join
如果您想要一定数量的活动进程,可以尝试if __name__ == '__main__':
proc_num = 20
proc_list = []
for i, url in enumerate(urls):
proxy = proxies[i % len(proxies)]
p = Process(target=worker, args=(url, proxy))
p.start()
proc_list.append(p)
if i % proc_num == 0 or i == len(urls)-1:
for proc in proc_list:
proc.join()
模块。只需修改Pool
定义即可接收元组。
worker
为了澄清事情,if __name__ == '__main__':
data = ((url, proxies[i % len(proxies)]) for i, url in enumerate(urls))
proc_num = 20
p = Pool(processes=proc_num)
results = p.imap(worker, data)
p.close()
p.join()
函数应该接收一个元组然后解压缩它。
worker
答案 1 :(得分:0)
请尝试以下代码:
for i in range(len(urls)):
url = urls[i] # Current URL
proxy = proxies[i % len(proxies)] # Current proxy
# ...