Question

我正在使用以下网站上的请求和 BeautifulSoup 模块练习网页抓取：

https://www.imdb.com/title/tt0080684/

到目前为止，我的代码正确输出了有问题的 json。我需要帮助从 json 中仅将 name 和 description 提取到响应字典中。

代码

# Send HTTP requests
import requests

import json

from bs4 import BeautifulSoup


class WebScraper:

    def send_http_request():

        # Obtain the URL via user input
        url = input('Input the URL:\n')

        # Get the webpage
        r = requests.get(url)

        soup = BeautifulSoup(r.content, 'html.parser')

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            print(p)
        else:
            print('\nInvalid movie page!')


WebScraper.send_http_request()

期望输出

{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training with Yoda, while his friends are pursued by Darth Vader and a bounty hunter named Boba Fett all over the galaxy."}

Answer 1

您只需要从 p 给定 2 个键 name 和 description 创建一个新字典。

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            desired_output = {"title": p["name"], "description": p["description"]}
            print(desired_output)
        else:
            print('\nInvalid movie page!')

输出：

{'title': 'Star Wars: Episode V - The Empire Strikes Back', 'description': 'Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training...'}

Answer 2

您可以解析字典，然后使用 dumps 方法打印一个新的 JSON 对象：

# Send HTTP requests
import requests

import json

from bs4 import BeautifulSoup


class WebScraper:

    def send_http_request():

        # Obtain the URL via user input
        url = input('Input the URL:\n')

        # Get the webpage
        r = requests.get(url)

        soup = BeautifulSoup(r.content, 'html.parser')

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            output = json.dumps({"title": p["name"], "description": p["description"]})
            print(output)
        else:
            print('\nInvalid movie page!')


WebScraper.send_http_request()

输出：

{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training..."}

解析 JSON 网络爬虫输出

2 个答案: