熊猫根据字典中描述的关系将两个数据框连接起来

时间:2018-07-26 20:37:22

标签: python pandas

我要基于列表字典中描述的关系加入两个数据框,其中字典中的键引用dfA idA列中的id,列表中的项是dfB idB列中的id。数据框和字典如下所示:

dfA
     colA    colB   idA
0    a       abc    3
1    b       def    4
2    b       ghi    5 

dfB
     colX    idB   colZ  
0    bob     7     a
1    bob     7     b
2    bob     7     c
3    jim     8     d
4    jake    9     a 
5    jake    9     e 

myDict = { '3': [ '7', '8' ], '4': [], '5': ['7', '9'] }

如何使用myDict连接两个数据框以生成如下所示的数据框?

dfC
     colA    colB   idA   colX    idB   colZ 
0    a       abc    3     bob     7      a
1                                        b
2                                        c
3                         jim     8      d
4    b       def    4     None    None  None
5    b       ghi    5     bob     7      a
6                                        b
7                                        c
8                         jake    9      a
9                                        e

2 个答案:

答案 0 :(得分:1)

您可以从字典中创建链接表(DataFrame)。下面是完整的工作示例。最后可能需要对行和列进行一些排序才能准确生成您的输出。

import pandas as pd
import numpy as np

dfA = pd.DataFrame({'colA': ('a', 'b', 'b'),
                    'colB': ('abc', 'def', 'ghi'),
                    'idA': ('3', '4', '5')})

dfB = pd.DataFrame({'colX': ('bob', 'bob', 'bob', 'jim', 'jake', 'jake'),
                    'idB': ('7', '7', '7', '8', '9', '9'),
                    'colZ': ('a', 'b', 'c', 'd', 'a', 'e')})

myDict = {'3': ['7', '8'], '4': [], '5': ['7', '9']}

dfC = pd.DataFrame(columns=['idA', 'idB'])
i = 0
for key, value in myDict.items():
    # the if statement is for empty list to create one record with NaNs
    if not value:
        dfC.loc[i, 'idA'] = key
        dfC.loc[i, 'idB'] = np.nan
        i += 1
    for val in value:
        dfC.loc[i, 'idA'] = key
        dfC.loc[i, 'idB'] = val
        i += 1

temp = dfA.merge(dfC, how='right')
result = temp.merge(dfB, how='outer')

print(result)

输出为:

  colA colB idA  idB  colX colZ
0    a  abc   3    7   bob    a
1    a  abc   3    7   bob    b
2    a  abc   3    7   bob    c
3    b  ghi   5    7   bob    a
4    b  ghi   5    7   bob    b
5    b  ghi   5    7   bob    c
6    a  abc   3    8   jim    d
7    b  def   4  NaN   NaN  NaN
8    b  ghi   5    9  jake    a
9    b  ghi   5    9  jake    e

答案 1 :(得分:0)

这不是最好的解决方案,但它相当简单,可以完成工作

temp = pd.DataFrame(dfA.idAaux.tolist(), index = dfA.idA).stack()
temp = temp.reset_index()[['idA', 0]]
temp.columns = ['idA', 'idB']
temp2 = dfA.merge(temp, left_on='idA', right_on='idA', how='left').drop('idAaux', axis=1)
temp2['idB'] = pd.to_numeric(temp2['idB']) 
res= temp2.merge(dfB, left_on='idB', right_on='idB', how='left')

输出:

colA    colB    idA idB colX    colZ
0   a   abc 3   7.0 bob a
1   a   abc 3   7.0 bob b
2   a   abc 3   7.0 bob c
3   a   abc 3   8.0 jim d
4   b   def 4   NaN NaN NaN
5   b   ghi 5   7.0 bob a
6   b   ghi 5   7.0 bob b
7   b   ghi 5   7.0 bob c
8   b   ghi 5   9.0 jake    a
9   b   ghi 5   9.0 jake    e