如何仅解析我感兴趣的键?

时间:2019-10-13 21:12:30

标签: python json csv web-scraping

我设法导出了一些我现在想导出到csv文件的json。但是,在我的代码处于当前状态的情况下,最终的csv每个单元格大约有一个字典。但是,我想要的是每列我感兴趣的每个键的值。每个json都有很多我实际上不感兴趣的信息-我只想要cadId,cadNomeCompleto,cadProfissao和habDes之类的键。其中一些在每个JSON的其他类别内,例如pt_ar_wsgode_objectos_DadosHabilitacoes内的habDes,cadHabilitacoes内的RegistoBiograficoList内。

我已经搜索了一些JSON文档,以查看是否有某些功能以我需要的方式将键作为输入。到目前为止,我还无法仅解析所需的键,并无法导出它们,例如使用csv文件创建统一的列。有人可以向我解释我在做错什么,并告诉我如何执行此操作吗?

import json
import csv
from csv import DictWriter


list_json = ['a705932387657456c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e7657456c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoXIII_json.txt&Inline=true',
             'a705932387657456c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932395953556c66616e4e76626935306548513d&fich=RegistoBiograficoXII_json.txt&Inline=true',
             'a705932387657456b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a6231684a5832707a6232347564486830&fich=RegistoBiograficoXI_json.txt&Inline=true',
             'a7059323876574355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e7657463971633239754c6e523464413d3d&fich=RegistoBiograficoX_json.txt&Inline=true',
             'a7059323876566b6c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566b6c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoVIII_json.txt&Inline=true',
             'a7059323876566b6c4a535355794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566b6c4a53563971633239754c6e523464413d3d&fich=RegistoBiograficoVIII_json.txt&Inline=true',
             'a7059323876566b6c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932395753556c66616e4e76626935306548513d&fich=RegistoBiograficoVII_json.txt&Inline=true',
             'a7059323876566b6b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62315a4a5832707a6232347564486830&fich=RegistoBiograficoVI_json.txt&Inline=true',
             'a7059323876566955794d45786c5a326c7a6247463064584a684c314a6c5a326c7a644739436157396e636d466d61574e76566c3971633239754c6e523464413d3d&fich=RegistoBiograficoV_json.txt&Inline=true',
             'a70593238765356596c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62306c575832707a6232347564486830&fich=RegistoBiograficoIV_json.txt&Inline=true',
             'a705932387653556c4a4a5449775447566e61584e7359585231636d4576556d566e61584e3062304a706232647959575a705932394a53556c66616e4e76626935306548513d&fich=RegistoBiograficoIII_json.txt&Inline=true',
             'a705932387653556b6c4d6a424d5a5764706332786864485679595339535a57647063335276516d6c765a334a685a6d6c6a62306c4a5832707a6232347564486830&fich=RegistoBiograficoII_json.txt&Inline=true',
             'a7059323876513239756333527064485670626e526c4c314a6c5a326c7a644739436157396e636d466d61574e765132397563313971633239754c6e523464413d3d&fich=RegistoBiograficoCons_json.txt&Inline=true']


result = []

for i in list_json:
    url = 'http://app.parlamento.pt/webutils/docs/doc.txt?path=6148523063446f764c324679626d56304c3239775a57356b595852684c3052685a47397a51574a6c636e5276637939535a576470633352764a544977516d6c765a334c446f575{}'.format(i)
    r = requests.get(url)
    cont = r.json()
    result.append(cont)


with open('bio.csv', 'w', newline='', encoding='utf-8-sig') as outfile:
    writer = DictWriter(outfile, ('?xml', 'RegistoBiografico'))
    writer.writerows(result)

1 个答案:

答案 0 :(得分:0)

您可以遍历子项以提取数据,例如

result = []
for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in ['cadId', 'cadNomeCompleto', 'cadProfissao']:
        try:
            tmp_row.append(child[k])
        except KeyError:
            print(f"  missing {k} for {child['cadId']}")
            # insert None for missing value so columns still match
            tmp_row.append(None)
    result.append(tmp_row)

运行此命令会显示一些条目没有全部数据:

  missing cadProfissao for 5950
  missing cadProfissao for 6063
  missing cadProfissao for 6121
  missing cadProfissao for 5534
  missing cadProfissao for 695
  missing cadProfissao for 5952
  missing cadProfissao for 4104
  missing cadProfissao for 4389
  missing cadProfissao for 2445
>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter']
>>>

要添加嵌套键,可以插入tmp_row.append(child['a']['b]['c']),但是随后还需要重复处理缺失值。

使用jsonpointer模块,您可以指定要访问的变量的路径:

from jsonpointer import resolve_pointer as j_get
result = []
search_dict = {
  'Id': '/cadId',
  'NomeCompleto': '/cadNomeCompleto',
  'Profissao':'/cadProfissao',
  'habDes':'/cadHabilitacoes/pt_ar_wsgode_objectos_DadosHabilitacoes/habDes',
}

for child in cont['RegistoBiografico']['RegistoBiograficoList']['pt_ar_wsgode_objectos_DadosRegistoBiograficoWeb']:
    tmp_row = []
    # iterate through keys in which we're interested
    for k in search_dict.keys():
        tmp_row.append(j_get(child, search_dict[k], None))
    result.append(tmp_row)

由于我为KeyError函数提供了默认值None,因此我删除了resolve_pointer异常处理。现在结果包含:

>>> result[123]
['5854', 'ISABEL CRISTINA RUA PIRES', 'operadora de call cen´ter', 'Ciência Política']

如果您对不完整的行或多少行感兴趣,可以使用列表理解:

>>> len([x for x in result if None in x])
165

但是,在csv输出中更容易查看。