使用硒无法从两个不同深度刮取物品

时间:2020-04-16 18:33:40

标签: python python-3.x selenium selenium-webdriver web-scraping

我已经在python中与硒结合创建了一个脚本,以从其着陆页中获得number的答案,并从其内页中获得询问者的name。我知道使用问题链接和下一页链接来抓取这两项比较容易,但这不是我打算在这里做的。底线是我试图仅使用点击来遍历不同的地方。但是,当我运行脚本时,在第二次迭代中,它指向此行answer = WebDriverWait(item,10)抛出了以下错误。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document

尽管我要查找的元素在登录页面和内页中都可用,但这是我必须从两个不同的深度抓取两个项目的要求。 < / p>

我知道如何使用请求来抓取它们,所以我也不愿意走那条路。

我正在尝试的脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = 'https://stackoverflow.com/questions/tagged/web-scraping'

def get_content(link):
    driver.get(link)
    while True:
        for count,item in enumerate(WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".question-summary")))):
            #error thrown in the following line in it's second iteration
            answer = WebDriverWait(item,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"[class$='answered'] > strong"))).text

            elem = driver.find_elements_by_css_selector(".summary a.question-hyperlink")[count]
            driver.execute_script("arguments[0].click();",elem)
            name = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1[itemprop='name'] > a"))).text
            print(answer,name)
            driver.back()

        try:
            next_page = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"a[rel='next']")))
            driver.execute_script("arguments[0].click();",next_page)
        except Exception:
            break

if __name__ == '__main__':
    with webdriver.Chrome() as driver:
        get_content(link)

如何从两个不同的深度刮擦这两个物品?

PS如果我踢出这行answer = WebDriverWait(item,10)----,该脚本将像超级按钮一样运行,遍历不同深度和多个页面。

1 个答案:

答案 0 :(得分:2)

获得StaleElementReferenceException是正常的,因为您离开页面并且对.question-summary元素的引用丢失了。

错误说明:Thrown when a reference to an element is now "stale".

要执行此操作,下面的代码将起作用。我将[class$='answered'] > strong选择器更改为[class*='answered'] > strong,否则如果问题已接受答案,则会出现错误。如果只希望不被接受的脚本,请根据需要修改脚本。

def get_content(link):
    driver.get(link)
    while True:
        count = len(WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".question-summary"))))
        for ix in range(count):
            question = driver.find_elements_by_css_selector(".question-summary")[ix]
            answers_count = question.find_element_by_css_selector("[class*='answered'] > strong").text

            driver.execute_script("arguments[0].click();", question.find_element_by_css_selector("a.question-hyperlink"))
            name = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a"))).text
            print(answers_count, name)
            driver.back()
        try:
            next_page = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "a[rel='next']")))
            driver.execute_script("arguments[0].click();", next_page)
        except Exception:
            break