如果我的词典列表如下:
my_list = [{datetime.datetime(1955, 1, 1, 0, 0): [{'coverage': 'MISSING',
'base_height': '914',
'cloud_type':'40'},
{'coverage': 'MISSING', 'base_height': '1280', 'cloud_type': '40'}]},
{datetime.datetime(1955, 1, 1, 1, 0): [{'coverage': '02',
'base_height': '600',
'cloud_type': '90'},
{'coverage': '06', 'base_height': '900', 'cloud_type':'90'}]},
{datetime.datetime(1955, 1, 1, 1, 0): [{'coverage': 'MISSING',
'base_height': '1524',
'cloud_type': '40'},
{'coverage': 'MISSING', 'base_height': '2438', 'cloud_type': '40'}]},
{datetime.datetime(1955, 1, 1, 2, 0): [{'coverage': '01',
'base_height': '600',
'cloud_type': '90'},
{'coverage': '07', 'base_height': '1050', 'cloud_type': '90'}]},
{datetime.datetime(1955, 1, 1, 2, 0): [{'coverage': 'MISSING',
'base_height': '1524',
'cloud_type': '40'},
{'coverage': 'MISSING', 'base_height': '5182', 'cloud_type': '40'}]},
{datetime.datetime(1955, 1, 1, 3, 0): [{'coverage': '01',
'base_height': '600',
'cloud_type': '90'},
{'coverage': '05', 'base_height': '1200', 'cloud_type': '90'}]},
{datetime.datetime(1955, 1, 1, 3, 0): [{'coverage': 'MISSING',
'base_height': '1524',
'cloud_type': '40'},
{'coverage': 'MISSING', 'base_height': '5182', 'cloud_type': '40'},
{'coverage': 'MISSING', 'base_height': '99999', 'cloud_type': 'MISSING'},
{'coverage': 'MISSING', 'base_height': '99999', 'cloud_type': 'MISSING'}]},
{datetime.datetime(1955, 1, 1, 4, 0): [{'coverage': '01',
'base_height': '750',
'cloud_type': '90'},
{'coverage': '05', 'base_height': '1200', 'cloud_type': '90'}]},
{datetime.datetime(1955, 1, 1, 4, 0): [{'coverage': 'MISSING',
'base_height': '1676',
'cloud_type': '40'},
{'coverage': 'MISSING', 'base_height': '5182', 'cloud_type': '40'}]}]
如何将其转换为如下系列:
1955-01-01 00:00:00+00:00 0 coverage 01
base_height 600
cloud_type 90
1 coverage 07
base_height 1050
cloud_type 90
现在,我正在尝试使用:
pd.concat([pd.DataFrame.from_dict(aa, orient='index').stack() for aa in my_list]).apply(pd.Series).stack()
但是列表推导和.apply(pd.Series)
花费了很长时间才能处理我的整个数据集(> 65000个列表条目)。
答案 0 :(得分:2)
试试看,在我的PC上,我的速度约为2ms:
from collections import defaultdict
from itertools import product, chain
#unpack the dicts in the nested dicts
#would love to learn from others if there is a better way here
#especially with the three levels of for loop
d = defaultdict(list)
for entry in my_list:
for key,value in entry.items():
for ent in value:
d[key].append(ent.values())
#using product from itertools
#combine the dates with each unpacked value
#essentially, u get one line with date, coverage, base height and cloud type
m = chain.from_iterable(product([key],val) for key,val in d.items())
#now we can safely go into pandas
res = pd.DataFrame(((key,*val) for key,val in m),
columns=['Date','coverage','base_height','cloud_type'])
#the glorious stack
fin = res.set_index('Date').stack()
fin.head()
Date
1955-01-01 coverage MISSING
base_height 914
cloud_type 40
coverage MISSING
base_height 1280
dtype: object
更新:多使用一些代码,我认为我得到了一种更简洁的方法,更少的代码-时间仍然是〜2ms,其中大部分用于创建数据帧:
d = []
for entry in my_list:
for k, v in entry.items():
for ent in v:
ent.update({'Date':k})
d.append(ent)
res = pd.DataFrame(d).set_index('Date').stack()
res.head()
Date
1955-01-01 coverage MISSING
base_height 914
cloud_type 40
coverage MISSING
base_height 1280
dtype: object