重塑pandas DataFrame以在嵌套字典

时间:2017-10-22 08:27:30

标签: python pandas dictionary dataframe

给出以下DataFrame:

   Category Area               Country Code Function Last Name     LanID  Spend1  Spend2  Spend3  Spend4  Spend5
0      Bisc   EE                  RU02,UA02       Mk     Smith    df3432     1.0     NaN     NaN     NaN     NaN
1      Bisc   EE                       RU02       Mk      Bibs    fdss34     1.0     NaN     NaN     NaN     NaN
2      Bisc   EE               UA02,EURASIA       Mk      Crow   fdsdr43     1.0     NaN     NaN     NaN     NaN
3      Bisc   WE                       FR31       Mk     Ellis   fdssdf3     1.0     NaN     NaN     NaN     NaN
4      Bisc   WE                  BE32,NL31       Mk     Mower   TOZ1720     1.0     NaN     NaN     NaN     NaN
5      Bisc   WE             FR31,BE32,NL31      LKU      Elan   SKY8851     1.0     1.0     1.0     1.0     1.0
6      Bisc   SE                       IT31       Mk    Bobret    3dfsfg     1.0     NaN     NaN     NaN     NaN
7      Bisc   SE                       GR31       Mk   Concept  MOSGX009     1.0     NaN     NaN     NaN     NaN
8      Bisc   SE   RU02,IT31,GR31,PT31,ES31      LKU     Solar   MSS5723     1.0     1.0     1.0     1.0     1.0
9      Bisc   SE        IT31,GR31,PT31,ES31       Mk      Brix    fdgd22     NaN     1.0     NaN     NaN     NaN
10     Choc   CE   RU02,CZ31,SK31,PL31,LT31      Fin    Ocoser    43233d     NaN     1.0     NaN     NaN     NaN
11     Choc   CE        DE31,AT31,HU31,CH31      Fin     Smuth     4rewf     NaN     1.0     NaN     NaN     NaN
12     Choc   CE              BG31,RO31,EMA      Fin    Momocs    hgghg2     NaN     1.0     NaN     NaN     NaN
13     Choc   WE             FR31,BE32,NL31      Fin   Bruntly    ffdd32     NaN     NaN     NaN     NaN     1.0
14     Choc   WE             FR31,BE32,NL31       Mk      Ofer  BROGX011     NaN     1.0     1.0     NaN     NaN
15     Choc   WE             FR31,BE32,NL31       Mk       Hem   NZJ3189     NaN     NaN     NaN     1.0     1.0
16      G&C   NE                  UA02,SE31       Mk       Cre   ORY9499     1.0     NaN     NaN     NaN     NaN
17      G&C   NE                       NO31       Mk      Qlyo   XVM7639     1.0     NaN     NaN     NaN     NaN
18      G&C   NE   GB31,NO31,SE31,IE31,FI31       Mk      Omny   LOX1512     NaN     1.0     1.0     NaN     NaN

我想将它导出到具有以下结构的嵌套Dict中:

    {RU02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {fdss34:     Bibs}
            {Bisc:  {SE:    {LKU:   {Spend1:    {MSS5723:   Solar}
                                    {Spend2:    {MSS5723:   Solar}
                                    {Spend3:    {MSS5723:   Solar}
                                    {Spend4:    {MSS5723:   Solar}
                                    {Spend5:    {MSS5723:   Solar}
            {Choc:  {CE:    {Fin:   {Spend2:    {43233d:   Ocoser}
            .....

    {UA02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {ffdsdr43:   Crow}
            {G&C:   {NE:    {Mkt:   {Spend1:    {ORY9499:     Cre}
    .....

基本上,在这个Dict中,我试图跟踪每个CountryCode,LastRames + LandIDs列表,每个花费类别(Spend1,Spend2等)及其属性(功能,类别,区域)是什么)。

DataFrame不是很大(少于200个),但它包含类别/区域/国家代码以及LastNames及其支出类别(多对多)之间几乎所有类型的组合。

我的挑战是,我无法弄清楚如何清楚地概念化我需要采取的步骤,以便正确准备DataFrame以便导出到Dict ....

到目前为止我想到的是我需要:

  1. 一种切分"国家代码"的内容的方法列基于"," separator:DONE
  2. 根据唯一的国家/地区代码创建新列,并在每行中预设列代码为1:DONE
  3. 以递归方式将DataFrame的索引设置为每个新添加的列
  4. 为每个国家/地区代码的每个行移动一个新的DataFrame,其中有数据
  5. 将所有新的DataFrame导出到Dicts,然后合并它们
  6. 不确定步骤3-6是否是解决此问题的最佳方式,因为我仍然难以理解应如何为我的案例配置 pd.DataFrame.to_dict (如果可能的话)...

    非常感谢您在编码方面提供的帮助,同时也简要介绍了每个阶段的思考过程。

    这是我自己走了多远......

    #keeping track of initial order of columns
    initialOrder = list(df.columns.values)
    
    # split the Country Code by ","
    CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]
    
    # add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
    #with NaN for row values
    df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])
    
    # reordering columns to have the newly added ones at the end
    reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
    df = df[reordered]
    
    
    # replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
    # exists in the initial column "Country Code"; do this for each row
    
    CCodeUniqueOnly = set(CCodeNoCommas)
    for c in CCodeUniqueOnly:   
        CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]
    
        #print (CCodeIsPresent_rowIndex)
        df.loc[CCodeIsPresent_rowIndex, c] = 1
    
    # no clue what do do next ??
    

1 个答案:

答案 0 :(得分:1)

如果您将数据帧重新调整为正确的格式,则可以使用@DSM给this question的答案中的方便的递归字典函数。目标是获取一个数据框,其中每行只包含一个“条目” - 您感兴趣的列的唯一组合。

首先,您需要将国家/地区代码字符串拆分为列表:

df['Country Code'] = df['Country Code'].str.split(',')

然后将这些列表扩展为多行(使用this question中的@ RomanPekar技术):

s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
    .stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)

然后,您可以将Spend*列重新整形为行,其中每个Spend*列都有一行,其值不是nan

spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
    .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
    .reset_index(level=1)['level_1'])) \
    .reset_index(drop=True)

现在您有一个数据框,其中嵌套字典中的每个级别都是自己的列。所以你可以使用这个递归字典函数:

def recur_dictify(frame):
    if len(frame.columns) == 1:
        if frame.values.size == 1: return frame.values[0][0]
        return frame.values.squeeze()
    grouped = frame.groupby(frame.columns[0])
    d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
    return d

并且只将它应用于您想要生成嵌套字典的列,按照它们应该嵌套的顺序列出:

cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])

这应该会产生你想要的结果。

一体化:

df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
    .stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)

spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
    .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
    .reset_index(level=1)['level_1'])) \
    .reset_index(drop=True)

def recur_dictify(frame):
    if len(frame.columns) == 1:
        if frame.values.size == 1: return frame.values[0][0]
        return frame.values.squeeze()
    grouped = frame.groupby(frame.columns[0])
    d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
    return d

cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])