在脚本成功运行之前,无法让我的脚本继续尝试使用其他代理

时间:2019-05-13 21:25:30

标签: python python-3.x selenium web-scraping scrapy

我已经写了一个脚本和硒结合使用,使用get_proxies()方法使用新生成的代理来发出代理请求。我使用了请求模块来获取代理,以便在脚本中重用它们。我想做的是从landing page解析所有帖子链接,然后从target page获取每个标题的名称。

我的以下脚本工作不一致,因为当get_random_proxy函数生成可用的代理服务器时,我的脚本便开始工作,否则它会失败。

如何让脚本继续尝试使用不同的代理,直到脚本成功运行?

到目前为止,我已经写过:

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

def get_proxies():   
    response = requests.get("https://www.sslproxies.org/")
    soup = BeautifulSoup(response.text,"lxml")
    proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
    return proxies

def get_random_proxy(proxy_vault):
    random.shuffle(proxy_vault)
    proxy_url = next(cycle(proxy_vault))
    return proxy_url

def start_script():
    proxy = get_proxies()
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(f'--proxy-server={get_random_proxy(proxy)}')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

class StackBotSpider(scrapy.Spider):
    name = "stackoverflow"

    start_urls = [
        'https://stackoverflow.com/questions/tagged/web-scraping'
    ]

    def __init__(self):
        self.driver = start_script()
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self,response):
        self.driver.get(response.url)
        for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".summary .question-hyperlink"))):
            yield scrapy.Request(elem.get_attribute("href"),callback=self.parse_details)

    def parse_details(self,response):
        self.driver.get(response.url)
        for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a"))):
            yield {"post_title":elem.text}

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',   
})
c.crawl(StackBotSpider)
c.start()

3 个答案:

答案 0 :(得分:1)

在选择随机代理以检查代理是否正常工作时,可以使用requests库。遍历代理:

  1. 随机播放并选择(pop)个随机代理
  2. 使用requests进行检查,如果成功,则返回代理,否则转到步骤 1

将您的get_random_proxy更改为以下内容:

def get_random_proxy(proxy_vault):
    while proxy_vault:
        random.shuffle(proxy_vault)
        proxy_url = proxy_vault.pop()
        proxy_dict = {
            'http': proxy_url,
            'https': proxy_url
        }
        try:
            res = requests.get("http://example.com", proxies=proxy_dict, timeout=10)
            res.raise_for_status()
            return proxy_url
        except:
            continue

如果get_random_proxy返回None,则表示所有代理均无效。在这种情况下,请忽略--proxy-server参数。

def start_script():
    proxy = get_proxies()
    chrome_options = webdriver.ChromeOptions()
    random_proxy = get_random_proxy(proxy)
    if random_proxy: # only when we successfully find a working proxy
        chrome_options.add_argument(f'--proxy-server={random_proxy}')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

答案 1 :(得分:0)

标记了后,仅使用Selenium就可以使用以下方法在Free Proxy List内列出的新激活代理发出代理请求:

  

注意:该程序将逐个调用代理列表中的代理,直到建立成功的代理连接并通过 Proxy Check https://www.whatismyip.com/

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import TimeoutException
    
    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_argument('disable-infobars')
    options.add_argument('--disable-extensions')
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get("https://sslproxies.org/")
    driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//th[contains(., 'IP Address')]"))))
    ips = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 1]")))]
    ports = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='table table-striped table-bordered dataTable']//tbody//tr[@role='row']/td[position() = 2]")))]
    driver.quit()
    proxies = []
    for i in range(0, len(ips)):
        proxies.append(ips[i]+':'+ports[i])
    print(proxies)
    for i in range(0, len(proxies)):
        try:
            print("Proxy selected: {}".format(proxies[i]))
            options = webdriver.ChromeOptions()
            options.add_argument('--proxy-server={}'.format(proxies[i]))
            driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
            driver.get("https://www.whatismyip.com/proxy-check/?iref=home")
            if "Proxy Type" in WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p.card-text"))):
                break
        except Exception:
            driver.quit()
    print("Proxy Invoked")
    
  • 控制台输出:

    ['190.7.158.58:39871', '175.139.179.65:54980', '186.225.45.146:45672', '185.41.99.100:41258', '43.230.157.153:52986', '182.23.32.66:30898', '36.37.160.253:31450', '93.170.15.214:56305', '36.67.223.67:43628', '78.26.172.44:52490', '36.83.135.183:3128', '34.74.180.144:3128', '206.189.122.177:3128', '103.194.192.42:55546', '70.102.86.204:8080', '117.254.216.97:23500', '171.100.221.137:8080', '125.166.176.153:8080', '185.146.112.24:8080', '35.237.104.97:3128']
    
    Proxy selected: 190.7.158.58:39871
    
    Proxy selected: 175.139.179.65:54980
    
    Proxy selected: 186.225.45.146:45672
    
    Proxy selected: 185.41.99.100:41258
    

答案 2 :(得分:0)

您可以尝试使用scrapy-rotated-proxy

以下是对您有帮助的另一参考:https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/

检查零件:

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
}

ROTATING_PROXY_LIST = [
    'proxy1.com:8000',
    'proxy2.com:8031',
    # ...
]

在您的设置中尝试此操作,肯定会得到您想要的。希望对您有帮助。