Python Pandas从复杂词典创建记录

时间:2016-08-25 19:25:20

标签: python json pandas dictionary

我已经处理了一些非常复杂的嵌套json对象,以获得以下通用字典格式:

{'key1':'value1',
 'key2':'value2',
 'key3':'value3',
 'key4':'value4',
 'key5':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']],
 'key6':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']]}

在列表列表中,每个列表都表示应该是"个别交易"当量。每个事务共享key1,key2,key3,key4对。可以有任意数量的列表。我试图有效地将这些转换为熊猫数据框中的记录,如下所示:

 key1_field, key2_field, key3_field, key4_field, key5_or_key6_field_1, key5_or_key6_field_2, key5_or_key6_field_3, key5_or_key6_indicator 
     value1,     value2,     value3,    value 4,               value5,               value6,               value7,                   key5
     value1,     value2,     value3,    value 4,               value5,               value6,               value7,                   key6                
     value1,     value2,     value3,    value 4,               value8,               value9,              value10,                   key5 
     value1,     value2,     value3,    value 4,               value8,               value9,              value10,                   key6

真诚地感谢任何帮助!到目前为止,这已成为一个挑战。谢谢!

修改

如我所知,我可以发布我一直试图解决的问题:

import pandas as pd
import numpy as np

d = {'key1':'value1',
     'key2':'value2',
     'key3':'value3',
     'key4':'value4',
     'key5':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']],
     'key6':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']]}

df = pd.DataFrame({k : pd.Series(v) for k, v in d.iteritems()})

我的剩余问题是单个键值在第一行之后是NaN。

enter image description here

2 个答案:

答案 0 :(得分:2)

一种选择是按原样读取字典并重新整形数据框:

df = pd.DataFrame({'key1':'value1',
 'key2':'value2',
 'key3':'value3',
 'key4':'value4',
 'key5':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']],
 'key6':[['value5', 'value6', 'value7'], ['value8', 'value9', 'value10']]})

df.set_index(['key1', 'key2', 'key3', 'key4']).stack().apply(pd.Series) \
  .rename(columns = lambda x: "value_" + str(x)).reset_index()

#     key1    key2    key3    key4  level_4 value_0 value_1 value_2
# 0 value1  value2  value3  value4  key5    value5  value6  value7
# 1 value1  value2  value3  value4  key6    value5  value6  value7
# 2 value1  value2  value3  value4  key5    value8  value9  value10
# 3 value1  value2  value3  value4  key6    value8  value9  value10

答案 1 :(得分:1)

试试这个:

pd.DataFrame({k : pd.Series(v) for k, v in d.iteritems()}).ffill()