在Python中使用Selenium和chromedriver进行网络抓取

时间:2020-03-21 23:17:26

标签: python selenium web-scraping selenium-chromedriver

我正在查看this页。我正在尝试使用Selenium和chromdriver刮擦此数据(由红色标记显示):

enter image description here

这是我的Python代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

chrome_options = Options()
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("disable-infobars")
driver = webdriver.Chrome(executable_path="/ABC/chromedriver", chrome_options=chrome_options)

driver.get("https://finance.yahoo.com/quote/IBM")
sleep(10)
estimated = driver.find_element_by_class_name("IbBox Ta(start) C($tertiaryColor)")

但是代码未获得Est. Return,经过长时间的等待后,它返回以下错误消息:

selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified

我在做什么错?从页面获取Est Return值的最佳和最快的方法是什么?

更新: 这是我在Chrome中使用检查元素的结果:

enter image description here

3 个答案:

答案 0 :(得分:1)

标头在获取所追求的价值方面起着重要作用,因此请确保拥有一个。鉴于这是您获得所需内容的方式。

import requests
from bs4 import BeautifulSoup

link = "https://finance.yahoo.com/quote/IBM"

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}

r = requests.get(link,headers=headers)
soup = BeautifulSoup(r.text,"lxml")
est_return = soup.select_one("[class='Mb\(8px\)']").get_text()
print(est_return)

答案 1 :(得分:0)

您可以改用XPath吗,它应该像这样:

estimated = driver.find_element_by_xpath("*//div[@class='IbBox Ta(start) C($tertiaryColor)']").text()

让我知道如何进行! :D

答案 2 :(得分:0)

此错误消息...

selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified

...暗示您使用的定位器策略不是有效的表达式。


要刮擦文本 -6%估算值返回,您需要为visibility_of_element_located()诱导 WebDriverWait ,然后可以使用以下Locator Strategy

  • 使用XPATH

    driver.get('https://finance.yahoo.com/quote/IBM')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Near Fair Value']//following::div[1]/div"))).text)
    
  • 控制台输出:

    -6% Est. Return
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC