美丽的汤。使用Python3刮取多个URL

时间:2018-04-03 02:56:01

标签: python-3.x beautifulsoup python-requests

当我运行以下代码时......

import requests
from bs4 import BeautifulSoup

counter = []
url = 'https://www.somemuseum.org/exhibitions/current-exhibitions'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
links = soup.find_all(href="{{ card.url }}")
counter.append(links)
print(counter)

它返回......

<a class="card card--exhibit {{ card.type }}" href="{{ card.url }}">

检查网站上的相同元素会将其存储为...

<a href="/exhibitions/listings/2018/current-listing" class="card card--exhibit is-tier1">

我想做的是for loop类似于以下内容......

for link in links:
    if card.type=="is-tier1":
        exhibit = soup.get('card.url')
        counter.append(exhibit)

我是Beautiful Soup的新手,所以对任何帮助表示赞赏。谢谢。

1 个答案:

答案 0 :(得分:3)

遗憾的是,您无法使用BeautifulSoup获取href数据,因为它们是由js呈现的。但是,您有几个选择。

第一个选项是selenium Selenium运行js并具有选择html元素的方法,但它非常缓慢而且很重。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


url = "https://www.metmuseum.org/exhibitions/current-exhibitions"
driver = webdriver.Firefox()
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.card')))
elements = driver.find_elements_by_css_selector('a.is-tier1')
links = [e.get_attribute("href") for e in elements]
driver.quit()

第二种选择是使用api。
数据由xhr请求加载到/api/Exhibitions/CurrentExhibitionsListing。您可以直接从api请求数据,并以json格式获取结果。

import requests

url = 'https://www.metmuseum.org/api/Exhibitions/CurrentExhibitionsListing?location=main|breuer|cloisters&page=1'
req = requests.get(url)
results = req.json()['results']
links = [
    'https://www.metmuseum.org' + i['url'] 
    for i in results if i['type'] == 'is-tier1'
    ]

这两种方法都会产生相同的结果,但我会使用第二种方法,因为它的速度要快得多。