如何将 json 文件中的嵌套字典进一步解析为 Python 中的数据帧

时间:2021-01-16 12:53:54

标签: python json dataframe dictionary flatten

我有一个非常大的 json 文件,我想将其转换为具有所需结构的数据框,稍后将在问题中解释。

示例 json 的一些记录如下所示:

JsonRecords = {
         'rec1': 
              {
                'words':[  ['A', 'B', 'C', '.'],  
                           ['D', 'E', 'F','.']],                           
                  'Ids':[  [0, 1],  
                           [2, 3]],

               'unique':[1, 1, 1, 0, 0, 1],

                'ments': {
                          "(0, 1)":{
                                    "A1": [0], 
                                    "A2": [0,1], 
                                    "A3": [1], 
                                    "A4": [1,0], 
                                    "A5": [0] 
                                   },                          
                         "(2, 3)": {
                                    "A1": [0], 
                                    "A2": [0], 
                                    "A3": [1],  
                                    "A5": [0] 
                                   }                  
                          }
              },  
      'rec2': 
             {
               'words':[   ['We', 'us', 'them', '.'], 
                           ['is', 'it', 'us''.'    ]], 
                 'Ids':[   [4, 5],  
                           [6, 7]],
              'unique':[0, 0, 0, 1, 1, 0],
                                
               "ments": {
                         "(4, 5)": {
                                    "A1": [0], 
                                    "A2": [0], 
                                    "A3": [0], 
                                    "A4": [0] 
                                   },                          
                        "(6, 7)": {
                                    "A1": [0], 
                                    "A2": [0],  
                                    "A4": [0,0], 
                                    "A6": [0,1]
                                  }
                     }
             }, 
      'rec3':             
     ..... more records
}

我使用以下代码解析了 json 示例:

  import pandas as pd
  #import json

  all_data = []
  for k, v in JsonRecords.items():
     words, Ids, unique, ments = v['words'], v['Ids'], v['unique'], v['ments']
     for t, val, m in zip(words, Ids, ments.items()):
       all_data.append({
        'records': k,
        'words': ' '.join(t),
        'Ids': val,
        'unique': unique,
        'ments': m            
        })
  #print(all_data)
  df = pd.DataFrame(all_data)
  df.to_csv('myData.csv', encoding='utf-8')
  print(df.head())

当我运行代码时,我得到以下数据帧结构:

 records     words          Ids         unique                    ments                    
  rec1      A, B, C.       [0, 1]   [1, 1, 1, 0, 0, 1]   ('(0, 1)', {'A1': [0], 'A2': [0, 1], 'A3': [1], 'A4': [1, 0], 'A5': [0]})                          
  rec1      D, E, F.       [2, 3]   [1, 1, 1, 0, 0, 1]   ('(2, 3)', {'A1': [0], 'A2': [0], 'A3': [1], 'A5': [0]})                          
  rec2      We, us, them.  [4, 5]   [0, 0, 0, 1, 1, 0]   ('(4, 5)', {'A1': [0], 'A2': [0], 'A3': [0], 'A4': [0]})                            
  rec2      is, it, us.    [6, 7]   [0, 0, 0, 1, 1, 0]   ('(6, 7)', {'A1': [0], 'A2': [0], 'A4': [0, 0], 'A6': [0, 1]})                        
  rec3  

如上所示,我无法根据 'Ids' 和 'words' 列进一步解析 'ments' 字典,这也应该通过解析 'ments' 字典及其嵌套值来重复。

我想要的这个嵌套 json 的数据帧结构如下所示。

Records       words          Ids     unique                 ments    A1  A2  A3  A4  A5  A6
  rec1      A, B, C.       [0, 1]   [1, 1, 1, 0, 0, 1]     [0, 1]     0   0   1   1   0 
  rec1      A, B, C.       [0, 1]   [1, 1, 1, 0, 0, 1]     [0, 1]         1       0      
  rec1      D, E, F.       [2, 3]   [1, 1, 1, 0, 0, 1]     [2, 3]     0   0   1       0  
  rec1      D, E, F.       [2, 3]   [1, 1, 1, 0, 0, 1]     [2, 3]                       
  rec2      We, us, them.  [4, 5]   [0, 0, 0, 1, 1, 0]     [4, 5]     0   0   0   0     
  rec2      We, us, them.  [4, 5]   [0, 0, 0, 1, 1, 0]     [4, 5]                       
  rec2      is, it, us.    [6, 7]   [0, 0, 0, 1, 1, 0]     [6, 7]     0   0       0       0
  rec2      is, it, us.    [6, 7]   [0, 0, 0, 1, 1, 0]     [6, 7]                 0       1
  rec3 
  ....... more records

我会感谢一些帮助..

1 个答案:

答案 0 :(得分:0)

使用 apply 和 json_normalize

def getMents(value):
    return value[0]
def getJson(value):
    return value[1]
df = pd.DataFrame(all_data)
df['json'] = df['ments'].apply(getJson)
jsonData = pd.json_normalize(df['json'])
df['ments'] = df['ments'].apply(getMents)
for col in jsonData.columns.values:
    df[col] = jsonData[col]
new_df = df[0:0]
results= df[0:0]
for index,row in df.iterrows():
    maxCount = 0
    for col in jsonData.columns.values:
        if isinstance(row[col],list):
            maxCount = max(maxCount,len(row[col]))
    for i in range(0,maxCount):
        count = len(new_df)
        new_df.loc[count] = row
    
        for col in jsonData.columns.values:
            if isinstance(new_df[col][i],list):        
                try:
                    new_df.loc[i,col]= new_df[col][i][i]
                except IndexError:
                    new_df.loc[i,col]=None
    results = pd.concat([results,new_df])
    new_df = df[0:0]

results