即使显式等待已经存在,也无法摆脱硬编码延迟

时间:2017-12-17 15:49:08

标签: python python-3.x selenium selenium-webdriver web-scraping

我已经在python中结合selenium编写了一些代码来解析quora.com中的不同问题。我的刮刀正在做这件事。事情就是我在这里使用硬编码延迟让刮刀工作,即使已经定义了Explicit Wait。由于页面是无限滚动的,我试图使滚动过程数量有限。现在,我有两个问题:

  1. 为什么wait.until(EC.staleness_of(page))在我的刮刀中不起作用。现在已经注释掉了。
  2. 如果我使用其他内容而非page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link"))),则刮刀会抛出错误:can't focus element
  3. 顺便说一下,我不想选择page = driver.find_element_by_tag_name('body')这个选项。

    这是我到目前为止所写的内容:

    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    driver.get("https://www.quora.com/topic/C-programming-language")
    wait = WebDriverWait(driver, 10)
    
    page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
    for scroll in range(10):
        page.send_keys(Keys.PAGE_DOWN)
        time.sleep(2)
        # wait.until(EC.staleness_of(page))
    
    for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
        print(item.text)
    
    driver.quit()
    

1 个答案:

答案 0 :(得分:1)

您可以尝试以下代码以获得尽可能多的XHR,然后解析页面:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)

page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
links_counter = len(wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "question_link"))))
while True:
    page.send_keys(Keys.END)
    try:
        wait.until(lambda driver: len(driver.find_elements_by_class_name("question_link")) > links_counter)
        links_counter = len(driver.find_elements_by_class_name("question_link"))
    except TimeoutException:
        break


for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
    print(item.text)

driver.quit()

这里我们向下滚动页面并等待最多10秒钟以加载更多链接,或者如果链接数量保持不变则中断while循环

关于你的问题:

  1. wait.until(EC.staleness_of(page))无法正常工作,因为当您向下滚动页面时,您无法获得新的DOM - 您只需创建XHR,即在现有DOM中添加更多链接,因此第一个链接({{ 1}}}在这种情况下不会陈旧

  2. (我对此并不十分自信,但是......)我猜你只能将密钥发送到可以聚焦的节点(用户可以手动设置焦点),例如:链接,输入字段,textareas,按钮......,但不是内容分区(page),段落(div)等