在python的嵌套结构中用熊猫构建数据框

时间:2018-07-13 14:52:23

标签: python json pandas nested structure

我想用有点太复杂的数据集来实现机器学习。我想和熊猫一起工作,然后在快速学习中使用一些内置模型。

数据外观在JSON文件中给出,示例如下所示:

lrModel.elasticNetParam

我想创建一个考虑这种“嵌套数据”的熊猫数据框,但是我不知道如何构建一个除“单个参数”之外还要考虑“嵌套参数”的数据框

例如,我不知道如何合并包含“单个参数”和症状的“ demo_Profile”,该症状是字典的列表,在某些情况下为单个值,在其他情况下为列表。

有人知道解决这个问题的方法吗?

编辑*********

上面显示的JSON只是一个示例,在其他情况下,列表中值的数量和症状的数量也会有所不同。因此,上面显示的示例并非在每种情况下都是固定的。

2 个答案:

答案 0 :(得分:2)

考虑熊猫的json_normalize。但是,由于存在更深的嵌套,请考虑分别处理数据,然后在“归一化”列上进行填充并进行合并。

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('myfile.json', 'r') as f:
    data = json.loads(f.read()) 

final_df = pd.concat([json_normalize(data['demo_Profile']), 
                      json_normalize(data['event']['symptoms']), 
                      json_normalize(data['event']['info_personal']), 
                      json_normalize(data['event']['labs'])], axis=1)

# FLATTEN NESTED LISTS
n_list = ['someinfo1', 'someinfo2', 'someinfo3', 'socrates.associations']

final_df[n_list] = final_df[n_list].apply(lambda col: 
                     col.apply(lambda x: x  if pd.isnull(x) else x[0]))

# FILLING FORWARD
norm_list = ['age', 'bmi', 'height', 'weight', 'sex', 'someinfo1', 'someinfo2', 'someinfo3', 
             'info1', 'info2', 'info3', 'info4', 'name', 'value']

final_df[norm_list] = final_df[norm_list].ffill()  

输出

print(final_df)

#     age  bmi  height   sex        someinfo1       someinfo2        someinfo3  weight   name socrates.associations socrates.onsetType socrates.timeCourse   info1   info2  info3  info4    name     value
# 0  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name1         associations1         onsetType1         timeCourse1  219.59  129.18  41.15  94.19  name1   valuelab
# 1  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name2                   NaN                NaN         timeCourse2  219.59  129.18  41.15  94.19  name1   valuelab
# 2  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name3                   NaN         onsetType2                 NaN  219.59  129.18  41.15  94.19  name1   valuelab
# 3  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name4                   NaN         onsetType3                 NaN  219.59  129.18  41.15  94.19  name1   valuelab
# 4  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name5         associations2                NaN                 NaN  219.59  129.18  41.15  94.19  name1   valuelab

答案 1 :(得分:1)

平整json数据的一种快速简便的方法是使用可通过pip安装的flatten_json包

pip install flatten_json

我希望您有许多条目的列表,看起来像您提供的条目。因此,以下代码将为您提供所需的结果:

import pandas as pd
from flatten_json import flatten

json_data = [{...patient1...}, {patient2...}, ...]

flattened = (flatten(entry) for entry in json_data)
df = pd.DataFrame(flattened)

在扁平化的数据中,列表条目带有数字后缀(我在“实验室”列表中添加了另一名患者,并带有附加条目):

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| index   demo_Profile_age  demo_Profile_bmi  demo_Profile_height demo_Profile_sex demo_Profile_someinfo1_0 demo_Profile_someinfo2_0 demo_Profile_someinfo3_0  demo_Profile_weight  event_info_personal_info1  event_info_personal_info2  event_info_personal_info3  event_info_personal_info4 event_labs_0_name event_labs_0_value event_labs_1_name event_labs_1_value event_symptoms_0_name event_symptoms_0_socrates_associations_0 event_symptoms_0_socrates_onsetType event_symptoms_0_socrates_timeCourse event_symptoms_1_name event_symptoms_1_socrates_timeCourse event_symptoms_2_name event_symptoms_2_socrates_onsetType event_symptoms_3_name event_symptoms_3_socrates_onsetType event_symptoms_4_name event_symptoms_4_socrates_associations_0 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0                98                 5                  160             male          some_more_info1           some_more_inf2          some_more_info3                  139                     219.59                     129.18                      41.15                      94.19            name1            valuelab               NaN                NaN                 name1                            associations1                          onsetType1                          timeCourse1                 name2                          timeCourse2                 name3                          onsetType2                 name4                          onsetType3                 name5                            associations2      |
| 1                98                 5                  160             male          some_more_info1           some_more_inf2          some_more_info3                  139                     219.59                     129.18                      41.15                      94.19            name1            valuelab            name2          valuelabr2                 name1                            associations1                          onsetType1                          timeCourse1                 name2                          timeCourse2                 name3                          onsetType2                 name4                          onsetType3                 name5                            associations2      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

flatten方法包含其他参数,以删除不需要的列或前缀。

注意:虽然此方法可为您提供所需的扁平化DataFrame,但我希望您在将数据集输入到机器学习算法时会遇到其他问题,具体取决于您的预测目标和编码方式。数据作为特征。