我用python编写了一个脚本,以抓取通过代理请求的网址。我在脚本中使用shuffle()
来随机获取代理。该脚本在某种程度上运行良好。该脚本的问题在于,它无法使用任何有效的代理时,由于循环,它又用于另一个 url 。我该如何纠正脚本,以使其尝试使用列表中的每个代理(如果需要)以获取所有urls
。
这是我的尝试:
import requests
from random import shuffle
url = "https://stackoverflow.com/questions?page={}&sort=newest"
def get_random_proxies():
proxies = ['35.199.8.64:80', '50.224.173.189:8080', '173.164.26.117:3128']
shuffle(proxies)
return iter(proxies)
for link in [url.format(page) for page in range(1,6)]:
proxy = next(get_random_proxies())
try:
response = requests.get(link,proxies={"http": "http://{}".format(proxy) , "https": "http://{}".format(proxy)})
print(f'{response.url}\n{proxy}\n')
except Exception:
print("something went wrong!!" + "\n")
proxy = next(get_random_proxies_iter())
我得到的输出:
https://stackoverflow.com/questions?page=1&sort=newest
35.199.8.64:80
https://stackoverflow.com/questions?page=2&sort=newest
50.224.173.189:8080
something went wrong!!
https://stackoverflow.com/questions?page=4&sort=newest
50.224.173.189:8080
something went wrong!!
您可以看到两个URL 'page=3&sort=newest'
和'page=5&sort=newest'
没有响应,而我的两个代理仍在工作。
后记:它们是免费代理,所以我有意发布了它们。
答案 0 :(得分:2)
那又怎么样:
def get_random_proxies():
proxies = ['35.199.8.64:80', '50.224.173.189:8080', '173.164.26.117:3128']
shuffle(proxies)
return proxies
for link in [url.format(page) for page in range(1,6)]:
for proxy in get_random_proxies():
try:
response = requests.get(link,proxies={"http":proxy , "https": proxy})
print(f'{response.url}\n{proxy}\n')
break # success, stop trying proxies
except Exception:
print("something went wrong!!" + "\n")
我不确定return(iter(...))
和next(result)
的计划是什么,但是更传统的方法是只返回列表,然后根据需要遍历列表的一部分。您已经完成了列表,返回列表无需花费任何精力。