从数据解析器解压缩嵌套字典

时间:2020-05-04 16:05:39

标签: python pandas

如果我的词典列表如下:

my_list = [{datetime.datetime(1955, 1, 1, 0, 0): [{'coverage': 'MISSING',
    'base_height': '914',
    'cloud_type':'40'},
   {'coverage': 'MISSING', 'base_height': '1280', 'cloud_type': '40'}]},
 {datetime.datetime(1955, 1, 1, 1, 0): [{'coverage': '02',
    'base_height': '600',
    'cloud_type': '90'},
   {'coverage': '06', 'base_height': '900', 'cloud_type':'90'}]},
 {datetime.datetime(1955, 1, 1, 1, 0): [{'coverage': 'MISSING',
    'base_height': '1524',
    'cloud_type': '40'},
   {'coverage': 'MISSING', 'base_height': '2438', 'cloud_type': '40'}]},
 {datetime.datetime(1955, 1, 1, 2, 0): [{'coverage': '01',
    'base_height': '600',
    'cloud_type': '90'},
   {'coverage': '07', 'base_height': '1050', 'cloud_type': '90'}]},
 {datetime.datetime(1955, 1, 1, 2, 0): [{'coverage': 'MISSING',
    'base_height': '1524',
    'cloud_type': '40'},
   {'coverage': 'MISSING', 'base_height': '5182', 'cloud_type': '40'}]},
 {datetime.datetime(1955, 1, 1, 3, 0): [{'coverage': '01',
    'base_height': '600',
    'cloud_type': '90'},
   {'coverage': '05', 'base_height': '1200', 'cloud_type': '90'}]},
 {datetime.datetime(1955, 1, 1, 3, 0): [{'coverage': 'MISSING',
    'base_height': '1524',
    'cloud_type': '40'},
   {'coverage': 'MISSING', 'base_height': '5182', 'cloud_type': '40'},
   {'coverage': 'MISSING', 'base_height': '99999', 'cloud_type': 'MISSING'},
   {'coverage': 'MISSING', 'base_height': '99999', 'cloud_type': 'MISSING'}]},
 {datetime.datetime(1955, 1, 1, 4, 0): [{'coverage': '01',
    'base_height': '750',
    'cloud_type': '90'},
   {'coverage': '05', 'base_height': '1200', 'cloud_type': '90'}]},
 {datetime.datetime(1955, 1, 1, 4, 0): [{'coverage': 'MISSING',
    'base_height': '1676',
    'cloud_type': '40'},
   {'coverage': 'MISSING', 'base_height': '5182', 'cloud_type': '40'}]}]

如何将其转换为如下系列:

1955-01-01 00:00:00+00:00  0  coverage         01
                              base_height     600
                              cloud_type       90
                           1  coverage         07
                              base_height    1050
                              cloud_type       90

现在,我正在尝试使用:

pd.concat([pd.DataFrame.from_dict(aa, orient='index').stack() for aa in my_list]).apply(pd.Series).stack()

但是列表推导和.apply(pd.Series)花费了很长时间才能处理我的整个数据集(> 65000个列表条目)。

1 个答案:

答案 0 :(得分:2)

试试看,在我的PC上,我的速度约为2ms:

from collections import defaultdict
from itertools import product, chain

#unpack the dicts in the nested dicts
#would love to learn from others if there is a better way here
#especially with the three levels of for loop
d = defaultdict(list)
for entry in my_list:
    for key,value in entry.items():
        for ent in value:
            d[key].append(ent.values())

#using product from itertools
#combine the dates with each unpacked value 
#essentially, u get one line with date, coverage, base height and cloud type
m = chain.from_iterable(product([key],val) for key,val in d.items())
#now we can safely go into pandas
res = pd.DataFrame(((key,*val) for key,val in m),
                   columns=['Date','coverage','base_height','cloud_type'])
#the glorious stack
fin = res.set_index('Date').stack()

fin.head() 

    Date                   
1955-01-01  coverage       MISSING
            base_height        914
            cloud_type          40
            coverage       MISSING
            base_height       1280
dtype: object

更新:多使用一些代码,我认为我得到了一种更简洁的方法,更少的代码-时间仍然是〜2ms,其中大部分用于创建数据帧:

d = []
for entry in my_list:
    for k, v in entry.items():
        for ent in v:
            ent.update({'Date':k})
            d.append(ent)

res = pd.DataFrame(d).set_index('Date').stack()

res.head()

Date                   
1955-01-01  coverage       MISSING
            base_height        914
            cloud_type          40
            coverage       MISSING
            base_height       1280
dtype: object