Question

我有一个非常大的json文件，该文件不适合内存中包含多个“结果”对象且每个对象中都有地址信息。我试图在Python 3.6中做到这一点：

打开大文件，仅将第一个“结果”对象读入内存。
将感兴趣的字段放平并过滤到python列表中。
扫描列表以查看数据是否为保存者，如果是，请追加将数据放入磁盘上csv文件的一行中，然后循环到顶部解析下一个“结果”对象。
如果没有，请刷新字典，然后循环到顶部。

我一直在尝试使用ijson进行解析，这种方式行之有效，但并非我所期望的那样。我已经在第1部分和第2部分中击败了几天，需要一些提示来进行改进。

这是大输入文件的简化示例：

{
  "meta": {
    "last_updated": "2018-08-04",
    "results": {
      "skip": 0,
      "limit": 1,
      "total": 545
    }
  },
  "results": [{
      "city": "SAN DIEGO",
      "address_1": "9001 SPECTRUM CENTER BLVD.",
      "address_2": "",
      "openfda": {
        "device_name": "Ventilator, Continuous, Non-Life-Supporting",
        "regulation_number": "868.5895"
      },
      "zip_code": "92123",
      "applicant": "RESMED LTD.",
      "decision_code": "SESE",
      "decision_date": "2012-03-29",
      "country_code": "US",
      "device_name": "VPAP ST-A",
      "contact": "DAVID  D'CRUZ",
      "state": "CA",
      "k_number": "K113288",
      "postal_code": "92123"
    },
    {
      "city": "McHenry",
      "address_1": "803 N. Front St. Suite 3",
      "address_2": "",
      "openfda": {
        "device_name": "Prosthesis, Knee, Hemi-, Patellar Resurfacing, Uncemented",
        "regulation_number": "888.3580"
      },
      "zip_code": "60050",
      "applicant": "WALDEMAR LINK GMBH & CO. KG",
      "decision_date": "1980-04-21",
      "country_code": "US",
      "device_name": "PATELLAR COMPONENT FOR LUBINUS PATELLAR",
      "contact": "",
      "state": "IL",
      "k_number": "K800800",
      "postal_code": "60050"
    },
    {
      "applicant": "QUEST INTL., INC.",
      "postal_code": "33181",
      "country_code": "US",
      "decision_date": "2003-01-06",
      "product_code": "DDC",
      "city": "NORTH MIAMI",
      "openfda": {
        "regulation_number": "866.5870",
        "device_name": "Thyroglobulin, Antigen, Antiserum, Control"
      },
      "state": "FL",
      "address_1": "1938 N.E. 148TH TERR.",
      "device_name": "SERAQUEST ANTI-THYROGLOBULIN",
      "contact": "ROBERT A CORT",
      "k_number": "K023592",
      "address_2": "",
      "zip_code": "33181"
    }
  ]
}

您可能会注意到，字段的顺序随每个“结果”而变化。

与下面的代码相比，我想找到一种选择感兴趣的字段的更优雅的方法：

import ijson
import csv 

wanted_keys = ['contact', 'applicant', 'address_1', 'address_2', 'city', 'state', 'postal_code', 'country_code', 'regulation_number']

with open('bigfile.json', 'rb') as fi:
    parser = ijson.parse(fi)
    for prefix, event, value in parser:
        if prefix == 'results.item.contact':
            contact = value
            print("contact is: {}".format(contact))
        elif prefix == 'results.item.applicant':
            applicant = value
            print("applicant is: {}".format(applicant))
        elif prefix == 'results.item.address_1':
            address_1 = value
        elif prefix == 'results.item.address_2':
            address_2 = value
        elif prefix == 'results.item.city':
            city = value
        elif prefix == 'results.item.state':
            state = value
        elif prefix == 'results.item.postal_code':
            postal_code = value
        elif prefix == 'results.item.country_code':
            country_code = value
        elif prefix == 'results.item.openfda.regulation_number':
            regulation_number = value

这将输出：

applicant is: RESMED LTD.
contact is: DAVID  D'CRUZ
applicant is: WALDEMAR LINK GMBH & CO. KG
contact is:
applicant is: QUEST INTL., INC.
contact is: ROBERT A CORT

当我尝试创建列表并将其打印在for循环的底部时，

contact_info = [contact, applicant, address_1, address_2, city, state, postal_code, country_code, regulation_number,]
print(contact_info)

此错误提示代码：

contact_info = [联系人，申请人，地址_1，地址_2，城市，州， postal_code，regulation_number] NameError：名称“联系人”不是定义

它已将联系人和申请人信息打印到控制台上，但是到for循环结束时联系人丢失了？

解析ijson流对象并过滤到列表中

0 个答案: