我正在寻找所有这些元素:
<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk>
我尝试使用:
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
spans = soup.findAll('span', {"class": "BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk"})
print(spans)
先前已声明“ URL”和“ headers”,但返回给我:“ []”
如何修改我的代码?
答案 0 :(得分:0)
这是一个棘手的问题,这就是我如何抓取它。由于JavaScript,您必须使用Selenium,但不能使用BeautifulSoup。我正在使用FireFox和geckodriver版本0.24。
您还必须调用一个隐式等待,以使页面完成加载,以匹配您在浏览器中单击页面源时看到的版本。要了解原因,请阅读本段
/ ** *获取上次加载页面的来源。如果页面在加载后已被修改(对于 *例如,通过Javascript)不能保证返回的文本是修改后的文本 *页面。请查阅所使用的特定驱动程序的文档,以确定是否 *返回的文本反映了页面的当前状态或Web上次发送的文本 *服务器。返回的页面源是底层DOM的表示形式:不要期望它会 *以与从Web服务器发送的响应相同的方式进行格式化或转义。想想看 *作为艺术家的印象。 *
代码
import os
import requests
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver
url = 'https://www.skyscanner.it/trasporti/voli/berl/amst/191231/200102/?adultsv2=1&childrenv2=&cabinclass=economy&rtn=1&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&qp_prevProvider=ins_browse&qp_prevCurrency=EUR&priceSourceId=taps-taps&qp_prevPrice=116#/'
driver = webdriver.Firefox(executable_path=r'(put your path here)\geckodriver-v0.24.0-win64\geckodriver.exe')
driver.get(url)
#there will be differences in div id='app-root'
#in page_selenium.txt with and without implicit wait
driver.implicitly_wait(10)
#with selenium
html_selenium = driver.page_source
bs_selenium = BeautifulSoup(html_selenium, 'lxml')
with open('page_selenium.txt', 'w', encoding='utf-8') as outfile:
outfile.write(bs_selenium.prettify())
#with requests
html_req = requests.get(url)
bs_req = BeautifulSoup(html_req.text,'lxml')
with open('page_bs.txt', 'w', encoding='utf-8') as outfile:
outfile.write(bs_req.prettify())
#open and compare div id='app-root' in page_selenium.txt and page_bs.txt and you will understand why your method didn't work
#now scrape using the bs from selenium
spanner = bs_selenium.find_all('span',{'class':'BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk'})
print(spanner)
#terminate the browser
os.system('tskill plugin-container')
driver.close()
driver.quit()
输出
[<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 172</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>]