从网页中提取特定文本

时间:2020-09-11 01:22:41

标签: python web-scraping text-extraction

请问如何提取

  • 县名
  • 对确诊的病例,死亡有答案吗?

例如:{'Alabama':{'Augauta County':{'confirmed cases':'1522','death':'24'},'Baldwin County':{'confirmed cases':'4787' ,'death':'7'}}}等

网页链接https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/state/alabama

我能够将该网页抓取并保存为文件。

非常感谢您!

文本1

{"@type":"ImageObject","url":"https://static1.squarespace.com/static/5a1340ef914e6bf3c0764c0c/t/5dbb4885de62ca18968da164/1582662729455/?format=1500w"},"founder":"Steve Ballmer","legalName":"Ballmer Giving LLC"}},"mainEntityOfPage":"/visualizations/coronavirus-covid-19-spread-map"}</script><script data-react-helmet="true" type="application/ld+json">{"@context":"https://schema.org","@type":"FAQPage","mainEntity":[{"@type":"Question","name":"How many COVID-19 cases in Autauga County, Alabama?","acceptedAnswer":{"@type":"Answer","text":1522}},{"@type":"Question","name":"How many COVID-19 cases in Baldwin County, Alabama?","acceptedAnswer":{"@type":"Answer","text":4787}},

文本2

{"@type":"Question","name":"How many COVID-19 deaths in Autauga County, Alabama?","acceptedAnswer":{"@type":"Answer","text":24}},{"@type":"Question","name":"How many COVID-19 deaths in Baldwin County, Alabama?","acceptedAnswer":{"@type":"Answer","text":46}},{"@type":"Question","name":"How many COVID-19 deaths in Barbour County, Alabama?","acceptedAnswer":{"@type":"Answer","text":7}}

1 个答案:

答案 0 :(得分:0)

您似乎提取的文本是json对象的数组,因此只需将其括在方括号中,然后使用json库加载到dict中即可

像这样吗?

import json

text2 = '''{"@type":"Question","name":"How many COVID-19 deaths in Autauga County, Alabama?","acceptedAnswer":{"@type":"Answer","text":24}},{"@type":"Question","name":"How many COVID-19 deaths in Baldwin County, Alabama?","acceptedAnswer":{"@type":"Answer","text":46}},{"@type":"Question","name":"How many COVID-19 deaths in Barbour County, Alabama?","acceptedAnswer":{"@type":"Answer","text":7}} '''
data = json.loads('[' + text2 + ']')
for item in data:
    print(item['name'])
    print(item['acceptedAnswer']['text'])