Question

我有一个数据帧字典dict，例如：

{   
‘table_1’:              name             color             type
                        Banana           Yellow            Fruit,
‘another_table_1’:      city             state             country
                        Atlanta          Georgia           United States,
‘and_another_table_1’:  firstname        middlename        lastname
                        John             Patrick           Snow,
‘table_2’:              name             color             type
                        Red              Apple             Fruit,
‘another_table_2’:      city             state             country
                        Arlington        Virginia          United States,
‘and_another_table_2’:  firstname        middlename        lastname
                        Alex             Justin            Brown,
‘table_3’:              name             color             type
                        Lettuce          Green             Vegetable,
‘another_table_3’:      city             state             country
                        Dallas           Texas             United States,
‘and_another_table_3’:  firstname        middlename        lastname
                        Michael          Alex              Smith             }

我想根据它们的名称将这些数据框合并在一起，以便最后我只有3个数据框：

table

name        color       type
Banana     Yellow     Fruit
Red         Apple     Fruit
Lettuce     Green     Vegetable

another_table

city        state          country
Atlanta     Georgia        United States
Arlington   Virginia       United States
Dallas      Texas          United States

and_another_table

firstname        middlename        lastname
John             Patrick           Snow
Alex             Justin            Brown
Michael          Alex              Smith

根据我的初步研究，似乎应该可以使用Python：

通过使用.split，字典理解和itertools.groupby根据关键字名称将字典中的数据帧分组在一起
使用这些分组结果创建字典词典
使用pandas.concat函数遍历这些词典并将数据帧分组在一起

我对Python没有太多的经验，我对如何实际编写此代码有些迷惑。

我已审查 How to group similar items in a list?和 Merge dataframes in a dictionary个帖子，但它们没有帮助，因为在我看来，数据框的名称长度有所不同。

我也不希望对任何数据框名称进行硬编码，因为其中有1000多个。

Answer 1

这是一种方法：

给出此数据帧字典：

dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
      'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
      'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
      'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
      'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
      'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
     }

tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])
dict_of_dfs = {i:pd.concat([dd[x] for x in dd.keys() if x.startswith(i)]) for i in tables}

输出一个新的组合表字典：

dict_of_dfs['table']

#      Name   color   type
# 0  Banana  Yellow  Fruit
# 0   Apple     Red  Fruit

dict_of_dfs['another_table']

#        city     state        Country
# 0   Atlanta   Georgia  United States
# 0  Arlinton  Virginia  United States

dict_of_dfs['and_another_table']

#   firstname middlename lastnme
# 0      John    Patrick    Snow
# 0      Alex     Justin   Brown

使用集合中的defaultdict的另一种方法，创建一个组合数据帧的列表：

from collections import defaultdict
import pandas as pd

dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
      'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
      'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
      'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
      'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
      'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
     }
tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])

d = defaultdict(list)

[d[i].append(dd[k]) for i in tables for k in dd.keys() if k.startswith(i)]
l_of_dfs = [pd.concat(d[i]) for i in d.keys()]
print(l_of_dfs[0])
print('\n')
print(l_of_dfs[1])
print('\n')
print(l_of_dfs[2])

输出：

       city     state        Country
0   Atlanta   Georgia  United States
0  Arlinton  Virginia  United States


  firstname middlename lastnme
0      John    Patrick    Snow
0      Alex     Justin   Brown


     Name   color   type
0  Banana  Yellow  Fruit
0   Apple     Red  Fruit

将数据框合并到数据框字典中

1 个答案: