我用python与硒结合编写了一个脚本,使用代理轮换从网站获取不同的帖子。该脚本仅尝试一次,然后退出。我现在想做的是让我的脚本继续尝试使用不同的代理,以便获得有效的响应,直到列表用完为止。
我认为我的实现是正确的,但是脚本尝试一次然后退出。因为它不会引发任何错误,所以即使有try/except
子句,我也不能让脚本继续尝试。
import random
from selenium import webdriver
from random import choice
link = 'https://stackoverflow.com/questions/tagged/web-scraping'
proxies = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001']
def start_script():
random.shuffle(proxies)
proxy_url = choice(proxies)
print("implementing:",proxy_url)
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server={proxy_url}')
driver = webdriver.Chrome(options=options)
return driver
def get_links(url):
driver = start_script()
try:
driver.get(url)
items = [item.get_attribute("href") for item in driver.find_elements_by_css_selector(".summary .question-hyperlink")]
for item in items:
print(item)
except Exception:
driver.quit()
get_links(url)
if __name__ == '__main__':
get_links(link)
当我在评论中质疑items
变量是否返回空值时,我决定提供一个工作代码,证明该变量实际上包含所需的列表。
from selenium import webdriver
link = 'https://stackoverflow.com/questions/tagged/web-scraping'
def get_links(url):
driver = webdriver.Chrome()
driver.get(url)
items = [item.get_attribute("href") for item in driver.find_elements_by_css_selector(".summary .question-hyperlink")]
for item in items:
print(item)
if __name__ == '__main__':
get_links(link)
PS The proxy list are the placeholders. They are not working ones.
如何让我的脚本继续尝试使用不同的代理来获取有效的响应,直到列表用完为止?
答案 0 :(得分:0)
我相信这会满足您的需求。我洗了代理,然后遍历它们。如果找到一个好的结果,它将处理项目并中断循环并完成。如果出现任何问题或找不到任何结果,它将尝试列表中的下一个代理。
import random
from selenium import webdriver
from random import choice
link = 'https://stackoverflow.com/questions/tagged/web-scraping'
proxies = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001']
def start_script(proxy_url):
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server={proxy_url}')
driver = webdriver.Chrome(options=options)
return driver
def get_links(url):
random.shuffle(proxies)
for proxy in proxies:
driver = start_script(proxy)
try:
driver.get(url)
print('Url {} retrieved, get elements'.format(url))
elements = driver.find_elements_by_css_selector(".summary .question-hyperlink")
print('Selected elements, check for None')
if elements is not None:
print('Found {} elements'.format(len(elements)))
items = [item.get_attribute("href") for item in elements]
if items is not None and len(items) > 0:
for item in items:
print(item)
break;
except Exception:
print('Proxy {} failed, trying next'.format(proxy))
finally:
driver.quit()
if __name__ == '__main__':
get_links(link)