我正在建造一个试图使用硒和代理的蜘蛛。主要目的是使蜘蛛网尽可能地坚硬,以免被网抓住。我知道scrapy具有模块“ scrapy-rotating-proxies”,但是我无法验证scrapy是否可以检查chromedriver在请求网页方面的成功状态,如果由于被捕获而失败,则运行切换代理。
第二,我不太确定我的计算机如何处理代理。例如,在任何情况下,当我设置代理值时,该值是否与在我的计算机上发出请求的任何内容一致?就是只要scrapy和webdriver其中之一设置值,便会具有相同的代理值?尤其是如果scrapy具有代理值,则在类定义内实例化的任何硒webdriver都会继承该代理吗?
我对这些工具缺乏经验,非常感谢您的帮助!
我一直在尝试寻找一种方法来测试和检查硒的代理值以及进行比较的刮擦
#gets the proxies and sets the value of the scrapy proxy list in settings
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]'):
#Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0],i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
proxy_pool = cycle(proxies)
url = 'https://httpbin.org/ip'
new_proxy_list = []
for i in range(1,30):
#Get a proxy from the pool
proxy = next(proxy_pool)
try:
response = requests.get(url,proxies={"http": proxy, "https": proxy})
#Grab and append proxy if valid
new_proxy_list.append(proxy)
except:
#Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work.
#We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url
print("Skipping. Connnection error")
#add to settings proxy list
settings.ROTATING_PROXY_LIST = new_proxy_list