使用BeautifulSoup查找特定类别的“跨度”

时间:2019-10-04 21:04:03

标签: python jquery html class beautifulsoup

This is what i want to find

我正在寻找所有这些元素:

<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk>

我尝试使用:

page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
spans = soup.findAll('span', {"class": "BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk"})

print(spans)

先前已声明“ URL”和“ headers”,但返回给我:“ []”

URL

如何修改我的代码?

1 个答案:

答案 0 :(得分:0)

这是一个棘手的问题,这就是我如何抓取它。由于JavaScript,您必须使用Selenium,但不能使用BeautifulSoup。我正在使用FireFox和geckodriver版本0.24。

您还必须调用一个隐式等待,以使页面完成加载,以匹配您在浏览器中单击页面源时看到的版本。要了解原因,请阅读本段

  

/ **      *获取上次加载页面的来源。如果页面在加载后已被修改(对于      *例如,通过Javascript)不能保证返回的文本是修改后的文本      *页面。请查阅所使用的特定驱动程序的文档,以确定是否      *返回的文本反映了页面的当前状态或Web上次发送的文本      *服务器。返回的页面源是底层DOM的表示形式:不要期望它会      *以与从Web服务器发送的响应相同的方式进行格式化或转义。想想看      *作为艺术家的印象。      *

来自Selenium source code

代码

import os
import requests
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver

url = 'https://www.skyscanner.it/trasporti/voli/berl/amst/191231/200102/?adultsv2=1&childrenv2=&cabinclass=economy&rtn=1&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&qp_prevProvider=ins_browse&qp_prevCurrency=EUR&priceSourceId=taps-taps&qp_prevPrice=116#/'
driver = webdriver.Firefox(executable_path=r'(put your path here)\geckodriver-v0.24.0-win64\geckodriver.exe')
driver.get(url)
#there will be differences in div id='app-root'
#in page_selenium.txt with and without implicit wait
driver.implicitly_wait(10)

#with selenium
html_selenium = driver.page_source
bs_selenium = BeautifulSoup(html_selenium, 'lxml')
with open('page_selenium.txt', 'w', encoding='utf-8') as outfile:
    outfile.write(bs_selenium.prettify())

#with requests
html_req = requests.get(url)
bs_req = BeautifulSoup(html_req.text,'lxml')
with open('page_bs.txt', 'w', encoding='utf-8') as outfile:
    outfile.write(bs_req.prettify())

#open and compare div id='app-root' in page_selenium.txt and page_bs.txt and you will understand why your method didn't work

#now scrape using the bs from selenium
spanner = bs_selenium.find_all('span',{'class':'BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk'})

print(spanner)

#terminate the browser
os.system('tskill plugin-container')
driver.close()
driver.quit()

输出

[<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 172</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>]