Question

我正在尝试通过遵循我发现的这篇文章https://medium.com/analytics-vidhya/how-to-scrape-news-headlines-from-reuters-27c0274dc13c

来学习如何使用Python来撰写新闻标题。

它工作得很好，但是当我尝试将其与其他新闻页进行模拟时，我继续遇到没有此类元素的错误。我意识到这是因为我在html中选择了错误的类元素，但是我不明白我应该选择什么其他类。

此新闻页面使用了以上脚本：https://www.reuters.com/news/archive/technologynews?view=page&page=6&pageSize=10

我尝试在以下页面上使用它，特别是调查当地的州立机构：

https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true

https://www.twincities.com/?s=%22Department+of+Human+Services%22&orderby=date&order=desc

这是代码，唯一的变化是用我正在研究的第一个替换路透社网页，并替换了用于选择按钮的class元素：

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import dateutil.parser
import time
import csv
from datetime import datetime
import io
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true')

count = 0
headlines =[]
dates = []
for x in range(500):    
    try:
        # loadMoreButton.click()
        # time.sleep(3)
        loadMoreButton = driver.find_element_by_class_name("pagination-shortcut-link")
        # driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(3)
        loadMoreButton.click()
        time.sleep(2)
        news_headlines = driver.find_elements_by_class_name("story-title")
        news_dates = driver.find_elements_by_class_name("timestamp")
        for headline in news_headlines:
            headlines.append(headline.text)
            print(headline.text)
        for date in news_dates:
            dates.append(date.text)
            print(date.text)
            count=count+1
            print("CLICKED!!:")
    except Exception as e:
        print(e)
        break

要获取类名，请右键单击，然后选择检查元素，然后复制看到的内容。但是我继续得到错误。我不确定我打算使用什么其他类元素。

Answer 1

您访问的每个网页的类名称都会更改，因为开发人员自己选择了特定WebElement的名称，至少在使用Selenium之前，您应该首先了解基本的HTML。当您将网页更改为剪贴时，您必须经常（即使不是总是）更改代码。我建议您在使用Selenium时也不要依赖ID，因为开发人员可以根据需要更改它们，例如，Google网站有一种算法可以更改ID，因此您的代码现在可以工作，但即使在同一网页上也不能工作。最好检查元素内的静态文本或在短时间内可能不会更改的内容。例如，如果您需要单击上面写有“下一步”的按钮，则请刮掉网页中的所有按钮，然后将它们循环遍历它们，然后检查带有“下一步”文本的按钮，然后单击（）。

在这里检查我的答案：How to click on the Ask to join button within https://meet.google.com using Selenium and Python?

Answer 2

要单击下一页的链接，您需要为element_to_be_clickable()引入WebDriverWait，并且可以使用以下Locator Strategies中的任何一个：

使用CSS_SELECTOR：

driver.get('https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.pagination-list-item.is-selected +li > a"))).click()

使用XPATH：

driver.get('https://www.startribune.com/search/?page=1&q=%22Department%20of%20Human%20Services%22&refresh=true')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//li[@class='pagination-list-item is-selected']//following::li[1]/a"))).click()

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

浏览器快照：

参考文献

您可以在NoSuchElementException上找到一些相关的讨论：

消息：没有这样的元素：无法找到元素：{“方法”：“ css选择器”，“选择器”：“。pagination-shortcut-link”}单击下一页链接

2 个答案:

参考文献