使用多种列类型(一些嵌套)展平嵌套json

时间:2017-10-26 22:48:09

标签: python pandas dataframe

我从API中提取输出如下所示(尝试尽可能地格式化):

{
    "other":{
                Not important.. (ignored later)
            },
    "resultList":[
        {
            "date": "2017-10-26T21:52:59.840Z",
            "uniqueId": "c0a9c665-0f6f-c8",
            "children":[
                {
                    "identifier": "FAMR@316069707@3160697070",
                    "score": 1,
                    "parentId": "c0a9c665-0f6f-4fc8"
                },
                {
                    Same format as first child...
                },
                {
                    Same format as first child...
                }
            ],
            "weights":[
                60,
                20,
                20
            ],
            "type": "ABC"
        },
        {
            Same format as first dictionary…
        }
    ]
}

根据对stackoverflow的搜索,我通过提取json来解决它,仅为resultList(这是我唯一关心的部分)规范化其输出,然后按列定向并转换为熊猫DataFrame。 这是代码:

import requests
import pandas as pd 
from pandas.io.json import json_normalize

# Get JSON from API
user = str(input("Enter User Name: ")) 
password = getpass.getpass("Enter Password: ") 
url = 'https://API_url'
req = requests.post(url = url, auth=(user, password))
out = req.json()

# Create normalized dataframe from API
solr_df = pd.DataFrame.from_dict(json_normalize(out["resultList"]), orient='columns')

但是,虽然这会将resultList展平为列,但children列仍会嵌套为词典列表(实际上附加了u,我不想要)并且weights列仍然是列表..

你可以帮助重组这个以返回一个结果,其中儿童和重量被压扁为列?

提前谢谢!

1 个答案:

答案 0 :(得分:0)

无法想到一种更有效的方法来做到这一点,虽然我确信存在。

循环浏览json对象并手动压平数据。

dfAll = pd.DataFrame()
for record in r['resultList']:

    conc = []
    otherFields = {}

    for field in record:

        if isinstance(record[field], list):
            if len(record[field]) > 0:
                if isinstance(record[field][0], dict):
                    conc.append(pd.DataFrame(record[field]))

                else:
                    conc.append(pd.DataFrame(record[field],columns=[field]))

        else:
            otherFields[field] = record[field]


    df = pd.concat(conc,axis=1)

    for field in otherFields:
        df[field] = otherFields[field]

    dfAll = dfAll.append(df)

dfAll


   weights                 identifier            parentId  score  \
0       60  FAMR@316069707@3160697070  c0a9c665-0f6f-4fc8      1   
1       20  FAMR@316069707@3160697070  c0a9c665-0f6f-4fc8      1   
2       20  FAMR@316069707@3160697070  c0a9c665-0f6f-4fc8      1   
0       10  FAMR@316069707@3160697070  c0a9c665-0f6f-4fc8      1   
1       20  FAMR@316069707@3160697070  c0a9c665-0f6f-4fc8      1   
2       30  FAMR@316069707@3160697070  c0a9c665-0f6f-4fc8      1   

                       date type          uniqueId  
0  2017-10-26T21:52:59.840Z  ABC  c0a9c665-0f6f-c8  
1  2017-10-26T21:52:59.840Z  ABC  c0a9c665-0f6f-c8  
2  2017-10-26T21:52:59.840Z  ABC  c0a9c665-0f6f-c8  
0  2015-10-26T21:52:59.840Z  ABC               123  
1  2015-10-26T21:52:59.840Z  ABC               123  
2  2015-10-26T21:52:59.840Z  ABC               123