在没有有效响应之前,我的脚本无法继续尝试使用列表中的其他代理

时间:2019-06-05 18:52:53

标签: python python-3.x selenium selenium-webdriver web-scraping

我用python与硒结合编写了一个脚本,使用代理轮换从网站获取不同的帖子。该脚本仅尝试一次,然后退出。我现在想做的是让我的脚本继续尝试使用不同的代理,以便获得有效的响应,直到列表用完为止。

我认为我的实现是正确的,但是脚本尝试一次然后退出。因为它不会引发任何错误,所以即使有try/except子句,我也不能让脚本继续尝试。

import random
from selenium import webdriver
from random import choice

link = 'https://stackoverflow.com/questions/tagged/web-scraping'

proxies = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001']

def start_script():
    random.shuffle(proxies)
    proxy_url = choice(proxies)
    print("implementing:",proxy_url)
    options = webdriver.ChromeOptions()
    options.add_argument(f'--proxy-server={proxy_url}')
    driver = webdriver.Chrome(options=options)
    return driver

def get_links(url):
    driver = start_script()
    try:
        driver.get(url)
        items = [item.get_attribute("href") for item in driver.find_elements_by_css_selector(".summary .question-hyperlink")]
        for item in items:
            print(item)

    except Exception:
        driver.quit()
        get_links(url)

if __name__ == '__main__':
    get_links(link)

当我在评论中质疑items变量是否返回空值时,我决定提供一个工作代码,证明该变量实际上包含所需的列表。

from selenium import webdriver

link = 'https://stackoverflow.com/questions/tagged/web-scraping'

def get_links(url):
    driver = webdriver.Chrome()
    driver.get(url)
    items = [item.get_attribute("href") for item in driver.find_elements_by_css_selector(".summary .question-hyperlink")]
    for item in items:
        print(item)

if __name__ == '__main__':
    get_links(link)

PS The proxy list are the placeholders. They are not working ones.

如何让我的脚本继续尝试使用不同的代理来获取有效的响应,直到列表用完为止?

1 个答案:

答案 0 :(得分:0)

我相信这会满足您的需求。我洗了代理,然后遍历它们。如果找到一个好的结果,它将处理项目并中断循环并完成。如果出现任何问题或找不到任何结果,它将尝试列表中的下一个代理。

import random
from selenium import webdriver
from random import choice

link = 'https://stackoverflow.com/questions/tagged/web-scraping'

proxies = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001']

def start_script(proxy_url):
    options = webdriver.ChromeOptions()
    options.add_argument(f'--proxy-server={proxy_url}')
    driver = webdriver.Chrome(options=options)
    return driver

def get_links(url):
    random.shuffle(proxies)
    for proxy in proxies:
        driver = start_script(proxy)
        try:
            driver.get(url)
            print('Url {} retrieved, get elements'.format(url))
            elements = driver.find_elements_by_css_selector(".summary .question-hyperlink")
            print('Selected elements, check for None')
            if elements is not None:
                print('Found {} elements'.format(len(elements)))
                items = [item.get_attribute("href") for item in elements]
                if items is not None and len(items) > 0:
                    for item in items:
                        print(item)
                    break;
        except Exception:
            print('Proxy {} failed, trying next'.format(proxy))
        finally:
            driver.quit()

if __name__ == '__main__':
    get_links(link)