Question

我从网页上读取了很多json。由于出现内存错误，无法一次读取。所以我正在尝试使用ijson库流式传输它。问题是我丢失了json的结构，因此无法正确，有序地获取数据。

我能够读取键值对等，但是不再有结构。

这是我的代码：

parser = ijson.parse(urllib.request.urlopen('https://data.medicaid.gov/resource/4qik-skk9.json?$limit=5'))
for prefix, event, value  in parser:
    print(str(prefix) +" "+str(event)+ " "+str(value))

这是json的结构：

[
    {
        "package_size_code": "60",
        "fda_ther_equiv_code": "NR",
        "fda_application_number": "204153",
        "clotting_factor_indicator": "N",
        "year": "2018",
        "fda_product_name": "LUZU Cream 1% 60gm",
        "labeler_name": "MEDICIS DERMATOLOGICS, INC.",
        "ndc": "99207085060",
        "product_code": "0850",
        "unit_type": "GM",
        "fda_approval_date": "2013-11-14T00:00:00",
        "market_date": "2014-03-14T00:00:00",
        "pediatric_indicator": "N",
        "package_size_intro_date": "2014-03-14T00:00:00",
        "units_per_pkg_size": "60000",
        "labeler_code": "99207",
        "desi_indicator": "1",
        "drug_category": "S",
        "quarter": "3",
        "cod_status": "3"
    },
    {
        "package_size_code": "60",
        "fda_ther_equiv_code": "AB",
        "fda_application_number": "21758",
        "clotting_factor_indicator": "N",
        "year": "2018",
        "fda_product_name": "VANOS CREAM .1%",
        "labeler_name": "MEDICIS DERMATOLOGICS, INC.",
        "ndc": "99207052560",
        "product_code": "0525",
        "unit_type": "GM",
        "fda_approval_date": "2005-02-11T00:00:00",
        "market_date": "2005-02-21T00:00:00",
        "pediatric_indicator": "N",
        "package_size_intro_date": "2005-02-21T00:00:00",
        "units_per_pkg_size": "60000",
        "labeler_code": "99207",
        "desi_indicator": "1",
        "drug_category": "I",
        "quarter": "3",
        "cod_status": "3"
    },
.
.
.
.
]

我通过使用ijson得到的输出是：

 start_array None                                           
item start_map None                                         
item map_key clotting_factor_indicator                      
item.clotting_factor_indicator string N                     
item map_key cod_status                                     
item.cod_status string 4                                    
item map_key desi_indicator                                 
item.desi_indicator string 1                                
item map_key drug_category                                  
item.drug_category string I                                 
item map_key fda_application_number                         
item.fda_application_number string 50007                    
item map_key fda_approval_date                              
item.fda_approval_date string 1990-09-30T00:00:00.000       
item map_key fda_product_name                               
item.fda_product_name string DOXYCYCLINE HYCLATE 100MG CAP  
item map_key fda_ther_equiv_code                            
item.fda_ther_equiv_code string AB                          
item map_key labeler_code                                   
item.labeler_code string 59762                              
item map_key labeler_name                                   
item.labeler_name string PFIZER, INC.                       
item map_key market_date                                    
item.market_date string 1990-09-30T00:00:00.000             
item map_key ndc                                            
item.ndc string 59762369001                                 
item map_key package_size_code                              
item.package_size_code string 01                            
item map_key package_size_intro_date                        
item.package_size_intro_date string 2015-05-01T00:00:00.000 
item map_key pediatric_indicator                            
item.pediatric_indicator string N                           
item map_key product_code                                   
item.product_code string 3690                               
.
.
.

我想以某种方式知道哪些数据属于哪个对象。有点像索引，但我不确定该怎么做。

Answer 1

也许，更好的主意是使用ijson.items：

item_gen = ijson.items(urllib.urlopen('https://data.medicaid.gov/resource/4qik-skk9.json?$limit=5'), 'item')

然后，您可以在for循环中或使用任何itertools对item_gen进行迭代。例如：

for item in item_gen:
    print(item)

我使用了Python 2，但我想它与Python 3非常相似。

我从here那里得到了这个主意。

Answer 2

当然还有一个结构。这就是各种start_ / end_事件的用途。给定这样的数据结构：

[
  {
    "name": "object1",
    "color": "blue"
  },
  {
    "name": "object2",
    "color": "red"
  }
]

您的循环将生成以下流：

 start_array None
item start_map None
item map_key name
item.name string object1
item map_key color
item.color string blue
item end_map None
item start_map None
item map_key name
item.name string object2
item map_key color
item.color string red
item end_map None
 end_array None

您使用start_array和end_array事件来检测何时进入和退出数组，并且使用start_map和end_map事件来检测何时进入和退出阵列。退出地图对象。这使您可以重构原始数据的结构。

例如，这是一个真正的哑解析器，它将从流中重建原始数据：

top = None
cur = None
cur_k = None
for prefix, event, value in parser:
    if event == 'start_array':
        top = []
    elif event == 'start_map':
        cur = {}
        top.append(cur)
    elif event == 'map_key':
        cur_k = value
    elif event == 'string':
        cur[cur_k] = value

print(top)

我说“真的很笨”是因为它仅适用于示例中显示的特定数据格式。否则可能会导致其破裂。您可能想要更坚固实用的东西。

如何解析json流中的对象

2 个答案: