此时,如果网页中存在大约 5 种不同类型的关键字,我的脚本将检查多个 url。根据找到与否的关键字,它会输出“ok”或“no”。
我使用 set_page_load_timeout(30)
来避免无限加载网址。
问题:有些网页在超时前没有完全加载(即使超时时间“非常”长)。但是我可以直观地看到(没有无头)页面已加载。至少它可以检查网页中的关键字,但它不会,超时后显示“失败”,并且最终输出不显示“否”的刮擦。
所以我不想在 30 秒后放置一个 except 但我想在 30 秒后停止加载页面并获取它可以获取的内容。
我的代码:
# coding=utf-8
import re
sites=[]
keywords_1=[]
keywords_2=[]
keywords_3=[]
keywords_4=[]
keywords_5=[]
import sys
from selenium import webdriver
import csv
import urllib.parse
from datetime import datetime
from datetime import date
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
def reader3(filename):
with open(filename, 'r') as csvfile:
# creating a csv reader object
csvreader = csv.reader(csvfile)
# extracting field names through first row
# extracting each data row one by one
for row in csvreader:
sites.append(str(row[0]).lower())
try:
reader3("data/script/filter_domain_OUTPUT.csv")
except Exception as e:
print(e)
sys.exit()
exc=[]
def reader3(filename):
with open(filename, 'r') as csvfile:
# creating a csv reader object
csvreader = csv.reader(csvfile)
# extracting field names through first row
# extracting each data row one by one
for row in csvreader:
exc.append(str(row[0]).lower())
try:
reader3("data/script/checking_EXCLUDE.csv")
except Exception as e:
print(e)
sys.exit()
def reader2(filename):
with open(filename, 'r') as csvfile:
# creating a csv reader object
csvreader = csv.reader(csvfile)
# extracting field names through first row
# extracting each data row one by one
for row in csvreader:
keywords_1.append(str(row[0]).lower())
keywords_2.append(str(row[1]).lower())
keywords_3.append(str(row[2]).lower())
keywords_4.append(str(row[3]).lower())
keywords_5.append(str(row[4]).lower())
try:
reader2("data/script/checking_KEYWORD.csv")
except Exception as e:
print(e)
sys.exit()
chrome_options = Options()
chrome_options.page_load_strategy = 'none'
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--lang=en')
chrome_options.add_argument('--disable-notifications')
#chrome_options.headless = True
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('enable-automation')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-browser-side-navigation')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=chrome_options)
for site in sites:
try:
status_1 = "no"
status_2 = "no"
status_3 = "no"
status_4 = "no"
status_5 = "no"
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
today = date.today()
print("[" + current_time + "] " + str(site))
if 'http' in site:
driver.get(site)
else:
driver.get("http://" + site)
r=str(driver.page_source).lower()
driver.set_page_load_timeout(30)
for keyword_1 in keywords_1:
if keyword_1 in r:
status_1="ok"
print("home -> " +str(keyword_1))
break
for keyword_2 in keywords_2:
if keyword_2 in r:
status_2="ok"
print("home -> " +str(keyword_2))
break
for keyword_3 in keywords_3:
if keyword_3 in r:
status_3="ok"
print("home -> " +str(keyword_3))
break
for keyword_4 in keywords_4:
if keyword_4 in r:
status_4="ok"
print("home -> " +str(keyword_4))
break
for keyword_5 in keywords_5:
if keyword_5 in r:
status_5="ok"
print("Home ->" +str(keyword_5))
break
with open('data/script/checking_OUTPUT.csv', mode='a') as employee_file:
employee_writer = csv.writer(employee_file, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL,lineterminator='\n')
write=[site,status_1,status_2,status_3,status_4,status_5]
employee_writer.writerow(write)
except Exception as e:
#driver.delete_all_cookies()
print("Fail")
driver.quit()
答案 0 :(得分:1)
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
WebDriver driver = new ChromeDriver(chromeOptions);
使用页面加载策略急切等待直到初始 html 加载,您也可以使用 none ,但如果出现计时问题,请确保您有显式/隐式等待元素
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
caps = DesiredCapabilities().CHROME
# caps["pageLoadStrategy"] = "normal" # Waits for full page load
caps["pageLoadStrategy"] = "none"
options = Options()
driver = webdriver.Chrome(desired_capabilities=caps, options=options)
url = 'https://www.gm-trucks.com/'
driver.get(url)
print(driver.title)
print("hi")
input()
或者:
options = Options()
options.set_capability("pageLoadStrategy", "none")
driver = webdriver.Chrome(options=options)
文档根据 selenium 4.0.0-alpha-7 更新
因此使用上述解决方案或更新到 selenium v4 以备将来保护
pip install selenium==4.0.0.a7
错误
https://github.com/SeleniumHQ/seleniumhq.github.io/issues/627
答案 1 :(得分:0)
首先,理想情况下 set_page_load_timeout()
和 page_load_strategy = 'none'
不应该放在一起。
set_page_load_timeout() 设置在引发错误之前等待页面加载完成的时间。
<块引用>您可以在How to set the timeout of 'driver.get' for python selenium 3.8.0?
中找到详细的讨论page_load_strategy = 'none'
导致 Selenium 在初始页面内容完全接收(html 内容下载)后立即返回。
您可以在How to set the timeout of 'driver.get' for python selenium 3.8.0?
中找到详细的讨论