如何将bs4.element.Tag转换为JSON字典?

时间:2018-08-31 00:55:36

标签: python html json web-scraping beautifulsoup

我正在使用Beautiful Soup 4在Web上刮擦食谱的HTML页面,并且application/ld+json脚本包含以下内容:

['\r\n{\r\n  "@context": "https://schema.org/",\r\n  "@type": "Recipe",\r\n  "name": "The College Boy",\r\n  "url": "https://www.bodybuilding.com/recipes/the-college-boy",\r\n  "author": {\r\n    "@type": "Person",\r\n    "name": "Matt Biss"\r\n  },\r\n  "image": [\r\n    "https://www.bodybuilding.com/images/2018/august/crockpot-4b-header-960x540.jpg",\r\n            "https://www.bodybuilding.com/images/2018/august/crockpot-4b-square-600x600.jpg"\r\n      ],\r\n  "datePublished": "2018-08-27 00:00:00.0",\r\n  "publisher": {\r\n    "@type": "Organization",\r\n    "name": "Bodybuilding.com",\r\n    "logo": {\r\n      "@type": "ImageObject",\r\n      "url": "https://www.bodybuilding.com/images/icons/bb-logo-clean.png",\r\n      "width": 666,\r\n      "height": 422\r\n    }\r\n  },\r\n  "description": "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing.",\r\n  "prepTime": "PT10M",\r\n  "cookTime": "PT420M",\r\n  "totalTime": "PT430M",\r\n  "recipeYield": "4 servings",\r\n  "recipeCuisine": "American",\r\n  "keywords": "Crockpot",\r\n  "nutrition": {\r\n    "@type": "NutritionInformation",\r\n            "calories": "607 calories",\r\n                "carbohydrateContent": "23 g",\r\n                "proteinContent": "70 g",\r\n                "fatContent": "26 g",\r\n        "servingSize": "4 servings"\r\n  },\r\n  "recipeIngredient": [\r\n                        "4 piece chicken breast",                    "1 16 oz can black beans, drained and rinsed",                    "1 15 oz can corn",                    "8 oz cream cheese"              ],\r\n  "recipeInstructions": [\r\n          {\r\n        "@type": "HowToStep",\r\n        "text": "Place chicken breasts in the Crock-Pot. They can still be frozen if that is your style."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Drain cans of black beans and corn and add them into the cauldron."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Top it with your salsa, stir it up, and let it go!"\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Slow cook for 7-8 hours on low, or 4-5 hours on high."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Save cream cheese until the food is nearly done; let it melt on top prior to serving."\r\n      }      ]\r\n}\r\n']

\r\n和间距很多。如何将其清理成字典,以便可以访问carbohydrateContentrecipeIngredient之类的键?

2 个答案:

答案 0 :(得分:0)

使用ast.literal_eval

例如:

import re
import ast

l = ['\r\n{\r\n  "@context": "https://schema.org/",\r\n  "@type": "Recipe",\r\n  "name": "The College Boy",\r\n  "url": "https://www.bodybuilding.com/recipes/the-college-boy",\r\n  "author": {\r\n    "@type": "Person",\r\n    "name": "Matt Biss"\r\n  },\r\n  "image": [\r\n    "https://www.bodybuilding.com/images/2018/august/crockpot-4b-header-960x540.jpg",\r\n            "https://www.bodybuilding.com/images/2018/august/crockpot-4b-square-600x600.jpg"\r\n      ],\r\n  "datePublished": "2018-08-27 00:00:00.0",\r\n  "publisher": {\r\n    "@type": "Organization",\r\n    "name": "Bodybuilding.com",\r\n    "logo": {\r\n      "@type": "ImageObject",\r\n      "url": "https://www.bodybuilding.com/images/icons/bb-logo-clean.png",\r\n      "width": 666,\r\n      "height": 422\r\n    }\r\n  },\r\n  "description": "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing.",\r\n  "prepTime": "PT10M",\r\n  "cookTime": "PT420M",\r\n  "totalTime": "PT430M",\r\n  "recipeYield": "4 servings",\r\n  "recipeCuisine": "American",\r\n  "keywords": "Crockpot",\r\n  "nutrition": {\r\n    "@type": "NutritionInformation",\r\n            "calories": "607 calories",\r\n                "carbohydrateContent": "23 g",\r\n                "proteinContent": "70 g",\r\n                "fatContent": "26 g",\r\n        "servingSize": "4 servings"\r\n  },\r\n  "recipeIngredient": [\r\n                        "4 piece chicken breast",                    "1 16 oz can black beans, drained and rinsed",                    "1 15 oz can corn",                    "8 oz cream cheese"              ],\r\n  "recipeInstructions": [\r\n          {\r\n        "@type": "HowToStep",\r\n        "text": "Place chicken breasts in the Crock-Pot. They can still be frozen if that is your style."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Drain cans of black beans and corn and add them into the cauldron."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Top it with your salsa, stir it up, and let it go!"\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Slow cook for 7-8 hours on low, or 4-5 hours on high."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Save cream cheese until the food is nearly done; let it melt on top prior to serving."\r\n      }      ]\r\n}\r\n']

for i in l:
    print( ast.literal_eval(re.sub(r'(:\s*\"(.*)\")', r":'\2'", i)) )
  • 注意:我使用正则表达式将单引号替换为外部双引号,因为您有一些嵌套的双引号,例如:'description': "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing."

答案 1 :(得分:0)

欢迎来到社区。

在从html提取名称/ URL凭据时使用strip()以避免不必要的事情。

name = output.strip("\r")
url = output.strip( "\n")

然后在dict / json中使用它们