处理缓慢加载的网页,从我的脚本中删除硬编码延迟

时间:2018-05-03 19:20:04

标签: python python-3.x selenium web-scraping selenium-chromedriver

我在python中编写了一个与selenium相关联的脚本,用于解析处理延迟加载方法的网页中的一些名称,网页在每个滚动到底部时显示其内容。我的脚本无错误地完成。但是,我无法解决的唯一问题是从我的脚本中取出硬编码延迟。我真的不知道如何使用explicit wait而不是hardcoded delay保持逻辑(在脚本中应用),因为它是为了提高效率。提前感谢您的帮助。

Webpage link

这是我到目前为止所尝试的(工作一个):

import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("find_the_link_above")

last_len = len(driver.find_elements_by_class_name("listing__name--link"))
new_len = last_len

while True:
    last_len = new_len
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(3) ##I wish to kick out this harcoded delay and use explicit wait in place

    items = driver.find_elements_by_class_name("listing__name--link")
    new_len = len(items)
    if last_len == new_len:break

for item in items:
    print(item.text)
driver.quit()

2 个答案:

答案 0 :(得分:1)

这是实现ExplicitWait的方式:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
driver.get("https://www.yellowpages.ca/search/si/1/coffee/all%20states")

last_len = len(driver.find_elements_by_class_name("listing__name--link"))

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    try:
        wait(driver, 3).until(lambda driver: len(driver.find_elements_by_class_name("listing__name--link")) > last_len)
        items = driver.find_elements_by_class_name("listing__name--link")
        last_len = len(items)
    except TimeoutException:
        break

for item in items:
    print(item.text)
driver.quit()

这应该允许您向下滚动并等待最多3秒(如果需要,增加超时),直到循环中元素数量增加或者在数字保持不变的情况下中断while循环

答案 1 :(得分:0)

要解析webpage中的名称,您可以使用以下代码块:

  • 代码块

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    items = []
    
    options = Options()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\path\to\chromedriver.exe')
    driver.get('https://www.yellowpages.ca/search/si/1/coffee/all%20states')
    items=driver.find_elements_by_css_selector("h3[itemprop='name']>a.listing__name--link")
    while(driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")):
        items.append(driver.find_elements_by_css_selector("h3[itemprop='name']>a.listing__name--link"))
    for item in items:
        print(item.text)
    
  • 控制台输出

    Tim Hortons
    Downtown Expresso Café
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Starbucks
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Starbucks
    Tim Hortons
    Tim Hortons
    Budokan
    Anchor Cafe House
    Starbucks
    Tim Hortons
    Tim Hortons
    Starbucks
    Tim Hortons
    Starbucks
    Tim Hortons
    Tim Hortons
    Colonial Coffee Co Ltd
    Personal Service Coffee
    Tim Hortons
    Suzie's Grill Cafe Inc
    Loaves N Fishes Catering & Cafe
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Tim Hortons
    Elizabeth Houte Coiffure
    The Grind House Cafe
    Tim Hortons
    Black Bench Coffee Roasters
    Tim Hortons