如何将具有字典列的数据帧转换为多级数据帧

时间:2019-08-26 09:42:35

标签: python pandas multi-level

我有DataFrame,其中的各列包含字典。

可以如下创建

 lis = [
     {'id': '1', 
     'author': {'self': 'A', 
     'displayName': 'A'}, 
     'created': '2018-12-18', 
     'items': {'field': 'status', 
         'fromString': 'Backlog'}}, 
     {'id': '2', 
     'author': {'self': 'B', 
     'displayName': 'B'}, 
     'created': '2018-12-18', 
     'items': {'field': 'status', 
         'fromString': 'Funnel'}}] 

pd.DataFrame(lis)  

                              author     created id                                           items
0  {'self': 'A', 'displayName': 'A'}  2018-12-18  1  {'field': 'status', 'fromString': 'Backlog'}
1  {'self': 'B', 'displayName': 'B'}  2018-12-18  2   {'field': 'status', 'fromString': 'Funnel'}

我想转换此信息多级DataFrame。

我一直在尝试

pd.MultiIndex.from_product(lis) 
pd.MultiIndex.from_frame(pd.DataFrame(lis))

但是无法获得我想要的结果。基本上我想要如下所示:

        author               created        id       items

self       displayName                             field   fromString
 A             A            2018-12-18       1      status   Backlog
 B             B            2018-12-18       2      status   Funnel

关于如何实现此目标的任何建议?

谢谢

2 个答案:

答案 0 :(得分:3)

您可以使用json.json_normalize-但列名用.分隔符展平:

from pandas.io.json import json_normalize

lis = [
     {'id': '1', 
     'author': {'self': 'A', 
     'displayName': 'A'}, 
     'created': '2018-12-18', 
     'items': {'field': 'status', 
         'fromString': 'Backlog'}}, 
     {'id': '2', 
     'author': {'self': 'B', 
     'displayName': 'B'}, 
     'created': '2018-12-18', 
     'items': {'field': 'status', 
         'fromString': 'Funnel'}}] 

df = json_normalize(lis)
print (df)
  id     created author.self author.displayName items.field items.fromString
0  1  2018-12-18           A                  A      status          Backlog
1  2  2018-12-18           B                  B      status           Funnel

对于列中和索引中的MulitIndex,请先由DataFrame.set_index的所有没有Mulitiindex的列创建.,然后使用str.split

df = df.set_index(['id','created'])
df.columns = df.columns.str.split('.', expand=True)
print (df)
              author               items           
                self displayName   field fromString
id created                                         
1  2018-12-18      A           A  status    Backlog
2  2018-12-18      B           B  status     Funnel

如果在列中需要MulitIndex-可以,但是列名称中缺少值:

df.columns = df.columns.str.split('.', expand=True)
print (df)
   id     created author               items           
  NaN         NaN   self displayName   field fromString
0   1  2018-12-18      A           A  status    Backlog
1   2  2018-12-18      B           B  status     Funnel

缺少的值应替换为空字符串:

df = df.rename(columns= lambda x: '' if x != x else x)
print (df)
  id     created author               items           
                   self displayName   field fromString
0  1  2018-12-18      A           A  status    Backlog
1  2  2018-12-18      B           B  status     Funnel

答案 1 :(得分:1)

尝试以下方法,希望对您有所帮助。

df = pd.io.json.json_normalize(lis)
print(sorted(df.columns))

tupleList = [tuple(values.split(".")) if "." in values else (values,None) for values in sorted(df.columns)]

df.columns=pd.MultiIndex.from_tuples(tuplelist)
print(df)

输出将如下所示

author              created     id   items
displayName self    NaN         NaN  field  fromString
    A       A        2018-12-18  1   status  Backlog
    B       B        2018-12-18  2   status  Funnel