熊猫-将数据从一列拆分为多列

时间:2020-05-06 16:22:07

标签: pandas

我具有以下格式的数据框:

id, data
101, [{"tree":[
               {"Group":"1001","sub-group":3,"Child":"100267","Child_1":"8 cm"},
               {"Group":"1002","sub-group":1,"Child":"102280","Child_1":"4 cm"},
               {"Group":"1003","sub-group":0,"Child":"102579","Child_1":"0.1 cm"}]}]
102, [{"tree":[
               {"Group":"2001","sub-group":3,"Child":"200267","Child_1":"6 cm"},
               {"Group":"2002","sub-group":1,"Child":"202280","Child_1":"4 cm"}]}]
103,  

我正在尝试将这一列中的数据分成多列

预期输出:

id, Group, sub-group, Child, Child_1, Group, sub-group, Child, Child_1, Group, sub-group, Child, Child_1
101, 1001, 3, 100267, 8 cm, 1002, 1, 102280, 4 cm, 1003, 0, 102579, 0.1 cm
102, 2001, 3, 200267, 6 cm, 2002, 1, 2022280, 4 cm
103

df.loc[:15, ['id','data']].to_dict()

的输出
{'id': {1: '101',
        4: '102',
        11: '103',
        15: '104',
        16: '105'},
        'data': {1: '[{"tree":[{"Group":"","sub-group":"3","Child":"100267","Child_1":"8 cm"}]}]',
        4: '[{"tree":[{"sub-group":"0.01","Child_1":"4 cm"}]}]',
        11: '[{"tree":[{"sub-group":null,"Child_1":null}]}]',
        15: '[{"tree":[{"Group":"1003","sub-group":15,"Child":"child_","Child_1":"41 cm"}]}]',
        16: '[{"tree":[{"sub-group":"0.00","Child_1":"0"}]}]'}}

1 个答案:

答案 0 :(得分:2)

您可以在列数据上使用explode,从中创建一个数据框,添加一个累加数列,然后使用set_indexstackunstackdrop以符合您的预期输出,join回到列ID

s = df['data'].dropna().str['tree'].explode()
df_f = df[['id']].join(pd.DataFrame(s.tolist(), s.index)\
                         .assign(cc=lambda x: x.groupby(level=0).cumcount()+1)\
                         .set_index('cc', append=True)\
                         .stack()\
                         .unstack(level=[-2,-1])\
                         .droplevel(0, axis=1), 
                       how='left')
print (df_f)
    id Group sub-group   Child Child_1 Group sub-group   Child Child_1 Group  \
0  101  1001         3  100267    8 cm  1002         1  102280    4 cm  1003   
1  102  2001         3  200267    6 cm  2002         1  202280    4 cm   NaN   
2  103   NaN       NaN     NaN     NaN   NaN       NaN     NaN     NaN   NaN   

  sub-group   Child Child_1  
0         0  102579  0.1 cm  
1       NaN     NaN     NaN  
2       NaN     NaN     NaN  

注意:虽然它确实符合您的预期输出,但是多次使用相同的列名并不是一个好习惯。我宁愿删除方法drop并展平multiindex列。

编辑:经过一番评论,我猜想一种以某种怪异的格式实际浏览整列的方法:

import ast
def f(x):
    try: 
        return ast.literal_eval(x.replace('null', "'nan'"))[0]['tree'] 
    except:
        return [{}]
# then create s with 
s = df['data'].apply(f).explode()
# then create df_f like above