当我运行以下代码时......
import requests
from bs4 import BeautifulSoup
counter = []
url = 'https://www.somemuseum.org/exhibitions/current-exhibitions'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
links = soup.find_all(href="{{ card.url }}")
counter.append(links)
print(counter)
它返回......
<a class="card card--exhibit {{ card.type }}" href="{{ card.url }}">
检查网站上的相同元素会将其存储为...
<a href="/exhibitions/listings/2018/current-listing" class="card card--exhibit is-tier1">
我想做的是for loop
类似于以下内容......
for link in links:
if card.type=="is-tier1":
exhibit = soup.get('card.url')
counter.append(exhibit)
我是Beautiful Soup的新手,所以对任何帮助表示赞赏。谢谢。
答案 0 :(得分:3)
遗憾的是,您无法使用BeautifulSoup获取href数据,因为它们是由js呈现的。但是,您有几个选择。
第一个选项是selenium
Selenium运行js并具有选择html元素的方法,但它非常缓慢而且很重。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "https://www.metmuseum.org/exhibitions/current-exhibitions"
driver = webdriver.Firefox()
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.card')))
elements = driver.find_elements_by_css_selector('a.is-tier1')
links = [e.get_attribute("href") for e in elements]
driver.quit()
第二种选择是使用api。
数据由xhr请求加载到/api/Exhibitions/CurrentExhibitionsListing
。您可以直接从api请求数据,并以json格式获取结果。
import requests
url = 'https://www.metmuseum.org/api/Exhibitions/CurrentExhibitionsListing?location=main|breuer|cloisters&page=1'
req = requests.get(url)
results = req.json()['results']
links = [
'https://www.metmuseum.org' + i['url']
for i in results if i['type'] == 'is-tier1'
]
这两种方法都会产生相同的结果,但我会使用第二种方法,因为它的速度要快得多。