解析 JSON 网络爬虫输出

时间:2021-03-04 20:16:02

标签: python beautifulsoup

我正在使用以下网站上的请求和 BeautifulSoup 模块练习网页抓取:

https://www.imdb.com/title/tt0080684/

到目前为止,我的代码正确输出了有问题的 json。我需要帮助从 json 中仅将 namedescription 提取到响应字典中。

代码

# Send HTTP requests
import requests

import json

from bs4 import BeautifulSoup


class WebScraper:

    def send_http_request():

        # Obtain the URL via user input
        url = input('Input the URL:\n')

        # Get the webpage
        r = requests.get(url)

        soup = BeautifulSoup(r.content, 'html.parser')

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            print(p)
        else:
            print('\nInvalid movie page!')


WebScraper.send_http_request()

期望输出

{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training with Yoda, while his friends are pursued by Darth Vader and a bounty hunter named Boba Fett all over the galaxy."}

2 个答案:

答案 0 :(得分:1)

您只需要从 p 给定 2 个键 namedescription 创建一个新字典。

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            desired_output = {"title": p["name"], "description": p["description"]}
            print(desired_output)
        else:
            print('\nInvalid movie page!')

输出:

{'title': 'Star Wars: Episode V - The Empire Strikes Back', 'description': 'Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training...'}

答案 1 :(得分:1)

您可以解析字典,然后使用 dumps 方法打印一个新的 JSON 对象:

# Send HTTP requests
import requests

import json

from bs4 import BeautifulSoup


class WebScraper:

    def send_http_request():

        # Obtain the URL via user input
        url = input('Input the URL:\n')

        # Get the webpage
        r = requests.get(url)

        soup = BeautifulSoup(r.content, 'html.parser')

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            output = json.dumps({"title": p["name"], "description": p["description"]})
            print(output)
        else:
            print('\nInvalid movie page!')


WebScraper.send_http_request()

输出:

{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training..."}