我已经写了一个脚本和硒结合使用,使用get_proxies()
方法使用新生成的代理来发出代理请求。我使用了请求模块来获取代理,以便在脚本中重用它们。我想做的是从landing page解析所有帖子链接,然后从target page获取每个标题的名称。
我的以下脚本工作不一致,因为当get_random_proxy
函数生成可用的代理服务器时,我的脚本便开始工作,否则它会失败。
如何让脚本继续尝试使用不同的代理,直到脚本成功运行?
到目前为止,我已经写过:
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
def get_proxies():
response = requests.get("https://www.sslproxies.org/")
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies
def get_random_proxy(proxy_vault):
random.shuffle(proxy_vault)
proxy_url = next(cycle(proxy_vault))
return proxy_url
def start_script():
proxy = get_proxies()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--proxy-server={get_random_proxy(proxy)}')
driver = webdriver.Chrome(options=chrome_options)
return driver
class StackBotSpider(scrapy.Spider):
name = "stackoverflow"
start_urls = [
'https://stackoverflow.com/questions/tagged/web-scraping'
]
def __init__(self):
self.driver = start_script()
self.wait = WebDriverWait(self.driver, 10)
def parse(self,response):
self.driver.get(response.url)
for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".summary .question-hyperlink"))):
yield scrapy.Request(elem.get_attribute("href"),callback=self.parse_details)
def parse_details(self,response):
self.driver.get(response.url)
for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a"))):
yield {"post_title":elem.text}
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(StackBotSpider)
c.start()
答案 0 :(得分:1)
在选择随机代理以检查代理是否正常工作时,可以使用requests
库。遍历代理:
pop
)个随机代理requests
进行检查,如果成功,则返回代理,否则转到步骤 1 将您的get_random_proxy
更改为以下内容:
def get_random_proxy(proxy_vault):
while proxy_vault:
random.shuffle(proxy_vault)
proxy_url = proxy_vault.pop()
proxy_dict = {
'http': proxy_url,
'https': proxy_url
}
try:
res = requests.get("http://example.com", proxies=proxy_dict, timeout=10)
res.raise_for_status()
return proxy_url
except:
continue
如果get_random_proxy
返回None
,则表示所有代理均无效。在这种情况下,请忽略--proxy-server
参数。
def start_script():
proxy = get_proxies()
chrome_options = webdriver.ChromeOptions()
random_proxy = get_random_proxy(proxy)
if random_proxy: # only when we successfully find a working proxy
chrome_options.add_argument(f'--proxy-server={random_proxy}')
driver = webdriver.Chrome(options=chrome_options)
return driver
答案 1 :(得分:0)
标记了selenium后,仅使用Selenium就可以使用以下方法在Free Proxy List内列出的新激活代理发出代理请求:
注意:该程序将逐个调用代理列表中的代理,直到建立成功的代理连接并通过 Proxy Check
页https://www.whatismyip.com/
代码块:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
options = webdriver.ChromeOptions()
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://sslproxies.org/")
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]"))))
ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 1]")))]
ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 2]")))]
driver.quit()
proxies = []
for i in range(0, len(ips)):
proxies.append(ips[i]+':'+ports[i])
print(proxies)
for i in range(0, len(proxies)):
try:
print("Proxy selected: {}".format(proxies[i]))
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server={}'.format(proxies[i]))
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.whatismyip.com/proxy-check/?iref=home")
if "Proxy Type" in WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p.card-text"))):
break
except Exception:
driver.quit()
print("Proxy Invoked")
控制台输出:
['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128']
Proxy selected: 190.7.158.58:39871
Proxy selected: 175.139.179.65:54980
Proxy selected: 186.225.45.146:45672
Proxy selected: 185.41.99.100:41258
答案 2 :(得分:0)
您可以尝试使用scrapy-rotated-proxy
以下是对您有帮助的另一参考:https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/
检查零件:
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
}
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
在您的设置中尝试此操作,肯定会得到您想要的。希望对您有帮助。