我想抓取标题和日期,但Bloomberg一直禁止我,所以我使用无头浏览器抓取了我需要的项目
这是我使用硒和刮擦的代码
import scrapy
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
class BloomergSpider(scrapy.Spider):
name = 'bloomerg'
allowed_domains = ['www.bloomberg.com']
start_urls = ['https://www.bloomberg.com/news/articles/2019-05-
30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker']
def parse(self, response):
driver = webdriver.Firefox()
driver.get('https://www.bloomberg.com/news/articles/2019-05-
30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker')
title = WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,
"//div[text()='markets']//following::
h1[1]"))).get_attribute("innerHTML")
date = WebDriverWait(driver,
10).until(EC.visibility_of_element_located((By.XPATH,
"//div[text()='markets']//following:: h1[1]//following::div[@class='lede-
text-
v2__times']/time[@itemprop='datePublished']"))).get_attribute("innerHTML")
driver.quit()
print(title)
print(date)
我遇到这种错误
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities
请帮助我,谢谢你
答案 0 :(得分:0)
您需要将geckodriver添加到系统环境PATH,这就是导致您出错的原因。
如果您没有一个(应该) 您可以从这里https://github.com/mozilla/geckodriver/releases
获取最新的如果您使用的是Windows,请搜索“编辑系统环境变量”,然后将geckodriver文件的路径附加到PATH环境变量中