我有一个非常大的json文件,该文件不适合内存中包含多个“结果”对象且每个对象中都有地址信息。我试图在Python 3.6中做到这一点:
我一直在尝试使用ijson进行解析,这种方式行之有效,但并非我所期望的那样。我已经在第1部分和第2部分中击败了几天,需要一些提示来进行改进。
这是大输入文件的简化示例:
{
"meta": {
"last_updated": "2018-08-04",
"results": {
"skip": 0,
"limit": 1,
"total": 545
}
},
"results": [{
"city": "SAN DIEGO",
"address_1": "9001 SPECTRUM CENTER BLVD.",
"address_2": "",
"openfda": {
"device_name": "Ventilator, Continuous, Non-Life-Supporting",
"regulation_number": "868.5895"
},
"zip_code": "92123",
"applicant": "RESMED LTD.",
"decision_code": "SESE",
"decision_date": "2012-03-29",
"country_code": "US",
"device_name": "VPAP ST-A",
"contact": "DAVID D'CRUZ",
"state": "CA",
"k_number": "K113288",
"postal_code": "92123"
},
{
"city": "McHenry",
"address_1": "803 N. Front St. Suite 3",
"address_2": "",
"openfda": {
"device_name": "Prosthesis, Knee, Hemi-, Patellar Resurfacing, Uncemented",
"regulation_number": "888.3580"
},
"zip_code": "60050",
"applicant": "WALDEMAR LINK GMBH & CO. KG",
"decision_date": "1980-04-21",
"country_code": "US",
"device_name": "PATELLAR COMPONENT FOR LUBINUS PATELLAR",
"contact": "",
"state": "IL",
"k_number": "K800800",
"postal_code": "60050"
},
{
"applicant": "QUEST INTL., INC.",
"postal_code": "33181",
"country_code": "US",
"decision_date": "2003-01-06",
"product_code": "DDC",
"city": "NORTH MIAMI",
"openfda": {
"regulation_number": "866.5870",
"device_name": "Thyroglobulin, Antigen, Antiserum, Control"
},
"state": "FL",
"address_1": "1938 N.E. 148TH TERR.",
"device_name": "SERAQUEST ANTI-THYROGLOBULIN",
"contact": "ROBERT A CORT",
"k_number": "K023592",
"address_2": "",
"zip_code": "33181"
}
]
}
您可能会注意到,字段的顺序随每个“结果”而变化。
与下面的代码相比,我想找到一种选择感兴趣的字段的更优雅的方法:
import ijson
import csv
wanted_keys = ['contact', 'applicant', 'address_1', 'address_2', 'city', 'state', 'postal_code', 'country_code', 'regulation_number']
with open('bigfile.json', 'rb') as fi:
parser = ijson.parse(fi)
for prefix, event, value in parser:
if prefix == 'results.item.contact':
contact = value
print("contact is: {}".format(contact))
elif prefix == 'results.item.applicant':
applicant = value
print("applicant is: {}".format(applicant))
elif prefix == 'results.item.address_1':
address_1 = value
elif prefix == 'results.item.address_2':
address_2 = value
elif prefix == 'results.item.city':
city = value
elif prefix == 'results.item.state':
state = value
elif prefix == 'results.item.postal_code':
postal_code = value
elif prefix == 'results.item.country_code':
country_code = value
elif prefix == 'results.item.openfda.regulation_number':
regulation_number = value
这将输出:
applicant is: RESMED LTD. contact is: DAVID D'CRUZ applicant is: WALDEMAR LINK GMBH & CO. KG contact is: applicant is: QUEST INTL., INC. contact is: ROBERT A CORT
当我尝试创建列表并将其打印在for循环的底部时,
contact_info = [contact, applicant, address_1, address_2, city, state, postal_code, country_code, regulation_number,]
print(contact_info)
此错误提示代码:
contact_info = [联系人,申请人,地址_1,地址_2,城市,州, postal_code,regulation_number] NameError:名称“联系人”不是 定义
它已将联系人和申请人信息打印到控制台上,但是到for循环结束时联系人丢失了?