我正在使用Beautiful Soup 4在Web上刮擦食谱的HTML页面,并且application/ld+json
脚本包含以下内容:
['\r\n{\r\n "@context": "https://schema.org/",\r\n "@type": "Recipe",\r\n "name": "The College Boy",\r\n "url": "https://www.bodybuilding.com/recipes/the-college-boy",\r\n "author": {\r\n "@type": "Person",\r\n "name": "Matt Biss"\r\n },\r\n "image": [\r\n "https://www.bodybuilding.com/images/2018/august/crockpot-4b-header-960x540.jpg",\r\n "https://www.bodybuilding.com/images/2018/august/crockpot-4b-square-600x600.jpg"\r\n ],\r\n "datePublished": "2018-08-27 00:00:00.0",\r\n "publisher": {\r\n "@type": "Organization",\r\n "name": "Bodybuilding.com",\r\n "logo": {\r\n "@type": "ImageObject",\r\n "url": "https://www.bodybuilding.com/images/icons/bb-logo-clean.png",\r\n "width": 666,\r\n "height": 422\r\n }\r\n },\r\n "description": "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing.",\r\n "prepTime": "PT10M",\r\n "cookTime": "PT420M",\r\n "totalTime": "PT430M",\r\n "recipeYield": "4 servings",\r\n "recipeCuisine": "American",\r\n "keywords": "Crockpot",\r\n "nutrition": {\r\n "@type": "NutritionInformation",\r\n "calories": "607 calories",\r\n "carbohydrateContent": "23 g",\r\n "proteinContent": "70 g",\r\n "fatContent": "26 g",\r\n "servingSize": "4 servings"\r\n },\r\n "recipeIngredient": [\r\n "4 piece chicken breast", "1 16 oz can black beans, drained and rinsed", "1 15 oz can corn", "8 oz cream cheese" ],\r\n "recipeInstructions": [\r\n {\r\n "@type": "HowToStep",\r\n "text": "Place chicken breasts in the Crock-Pot. They can still be frozen if that is your style."\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Drain cans of black beans and corn and add them into the cauldron."\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Top it with your salsa, stir it up, and let it go!"\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Slow cook for 7-8 hours on low, or 4-5 hours on high."\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Save cream cheese until the food is nearly done; let it melt on top prior to serving."\r\n } ]\r\n}\r\n']
\r
,\n
和间距很多。如何将其清理成字典,以便可以访问carbohydrateContent
或recipeIngredient
之类的键?
答案 0 :(得分:0)
使用ast.literal_eval
例如:
import re
import ast
l = ['\r\n{\r\n "@context": "https://schema.org/",\r\n "@type": "Recipe",\r\n "name": "The College Boy",\r\n "url": "https://www.bodybuilding.com/recipes/the-college-boy",\r\n "author": {\r\n "@type": "Person",\r\n "name": "Matt Biss"\r\n },\r\n "image": [\r\n "https://www.bodybuilding.com/images/2018/august/crockpot-4b-header-960x540.jpg",\r\n "https://www.bodybuilding.com/images/2018/august/crockpot-4b-square-600x600.jpg"\r\n ],\r\n "datePublished": "2018-08-27 00:00:00.0",\r\n "publisher": {\r\n "@type": "Organization",\r\n "name": "Bodybuilding.com",\r\n "logo": {\r\n "@type": "ImageObject",\r\n "url": "https://www.bodybuilding.com/images/icons/bb-logo-clean.png",\r\n "width": 666,\r\n "height": 422\r\n }\r\n },\r\n "description": "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing.",\r\n "prepTime": "PT10M",\r\n "cookTime": "PT420M",\r\n "totalTime": "PT430M",\r\n "recipeYield": "4 servings",\r\n "recipeCuisine": "American",\r\n "keywords": "Crockpot",\r\n "nutrition": {\r\n "@type": "NutritionInformation",\r\n "calories": "607 calories",\r\n "carbohydrateContent": "23 g",\r\n "proteinContent": "70 g",\r\n "fatContent": "26 g",\r\n "servingSize": "4 servings"\r\n },\r\n "recipeIngredient": [\r\n "4 piece chicken breast", "1 16 oz can black beans, drained and rinsed", "1 15 oz can corn", "8 oz cream cheese" ],\r\n "recipeInstructions": [\r\n {\r\n "@type": "HowToStep",\r\n "text": "Place chicken breasts in the Crock-Pot. They can still be frozen if that is your style."\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Drain cans of black beans and corn and add them into the cauldron."\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Top it with your salsa, stir it up, and let it go!"\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Slow cook for 7-8 hours on low, or 4-5 hours on high."\r\n }, {\r\n "@type": "HowToStep",\r\n "text": "Save cream cheese until the food is nearly done; let it melt on top prior to serving."\r\n } ]\r\n}\r\n']
for i in l:
print( ast.literal_eval(re.sub(r'(:\s*\"(.*)\")', r":'\2'", i)) )
'description': "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing."
答案 1 :(得分:0)
欢迎来到社区。
在从html提取名称/ URL凭据时使用strip()以避免不必要的事情。
name = output.strip("\r")
url = output.strip( "\n")
然后在dict / json中使用它们