我正在使用以下网站上的请求和 BeautifulSoup 模块练习网页抓取:
https://www.imdb.com/title/tt0080684/
到目前为止,我的代码正确输出了有问题的 json。我需要帮助从 json 中仅将 name
和 description
提取到响应字典中。
代码
# Send HTTP requests
import requests
import json
from bs4 import BeautifulSoup
class WebScraper:
def send_http_request():
# Obtain the URL via user input
url = input('Input the URL:\n')
# Get the webpage
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
print(p)
else:
print('\nInvalid movie page!')
WebScraper.send_http_request()
期望输出
{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training with Yoda, while his friends are pursued by Darth Vader and a bounty hunter named Boba Fett all over the galaxy."}
答案 0 :(得分:1)
您只需要从 p
给定 2 个键 name
和 description
创建一个新字典。
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
desired_output = {"title": p["name"], "description": p["description"]}
print(desired_output)
else:
print('\nInvalid movie page!')
输出:
{'title': 'Star Wars: Episode V - The Empire Strikes Back', 'description': 'Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training...'}
答案 1 :(得分:1)
您可以解析字典,然后使用 dumps
方法打印一个新的 JSON 对象:
# Send HTTP requests
import requests
import json
from bs4 import BeautifulSoup
class WebScraper:
def send_http_request():
# Obtain the URL via user input
url = input('Input the URL:\n')
# Get the webpage
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
output = json.dumps({"title": p["name"], "description": p["description"]})
print(output)
else:
print('\nInvalid movie page!')
WebScraper.send_http_request()
输出:
{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training..."}