Python:如何合并两个值不唯一的数据框

时间:2018-08-11 17:07:47

标签: pandas merge

我有两个数据帧

import pandas as pd
a = pd.DataFrame( { 'port':[1,1,0,1,0], 'cd':[1,2,3,2,1], 'date':["2014-02-26","2014-02-25","2014-02-26","2014-02-26","2014-02-25"] } )
b = pd.DataFrame( { 'port':[0,1,0,1,0], 'fac':[2,1,2,2,3], 'date': ["2014-02-25","2014-02-25","2014-02-26","2014-02-26","2014-02-27"] } )

我需要做的是获取每个日期端口对,例如说端口0和日期2014-02-25,在fac中查找b值并将其填充到新列中在a中。因此,输出应类似于

port cd date         fac 
1    1  "2014-02-26" 2
1    2  "2014-02-25" 1
... (so on) ...

我尝试仅合并日期和端口上的框架,但是出现了一个错误,我认为这是由于数据框架的大小不同而引起的-我有点不希望它能正常工作

3 个答案:

答案 0 :(得分:2)

如果您希望合并两个数据框,则应使用merge

import pandas as pd
a = pd.DataFrame( { 'port':[1,1,0,1,0], 'cd':[1,2,3,2,1], 
         'date':["2014-02-26","2014-02-25","2014-02-26","2014-02-26","2014-02-25"]})

b = pd.DataFrame( { 'port':[0,1,0,1,0], 'fac':[2,1,2,2,3], 
         'date': ["2014-02-25","2014-02-25","2014-02-26","2014-02-26","2014-02-27"]})

df = a.merge(b)
print (df)

输出:

  port  cd  date       fac
0   1   1   2014-02-26  2
1   1   2   2014-02-26  2
2   1   2   2014-02-25  1
3   0   3   2014-02-26  2
4   0   1   2014-02-25  2

答案 1 :(得分:1)

我认为需要drop_duplicatesmerge

cols = ['port','date']
df = a.drop_duplicates(cols).merge(b, on=cols)
print (df)
   port  cd        date  fac
0     1   1  2014-02-26    2
1     1   2  2014-02-25    1
2     0   3  2014-02-26    2
3     0   1  2014-02-25    2

但是如果想要所有重复对的组合:

cols = ['port','date']
df1 = a.merge(b, on=cols)
print (df1)
   port  cd        date  fac
0     1   1  2014-02-26    2
1     1   2  2014-02-26    2
2     1   2  2014-02-25    1
3     0   3  2014-02-26    2
4     0   1  2014-02-25    2

答案 2 :(得分:1)

我建议您在数据框架A 创建新列,并通过“ numpy.vectorize”填充它

import pandas as pd
import numpy as np

A = pd.DataFrame({'port': [1, 1, 0, 1, 0], 'cd': [1, 2, 3, 2, 1], 'date': ["2014-02-26", "2014-02-25", "2014-02-26", "2014-02-26", "2014-02-25"]})
B = pd.DataFrame({'port': [0, 1, 0, 1, 0], 'fac': [2, 1, 2, 2, 3], 'date': ["2014-02-25", "2014-02-25", "2014-02-26", "2014-02-26", "2014-02-27"]})

数据框B 中设置索引,以按“日期”和“端口”进行访问:

C = B.set_index(['date', 'port'])

然后,创建函数,该函数将应用于数据帧A 中的每一行:

def get_fac(date, port):
    try:
        return C.loc[date].loc[port]['fac']
    except KeyError:
        return ''

A['fac'] = np.vectorize(get_fac)(A['date'], A['port'])

这是输出:

   cd        date  port  fac
0   1  2014-02-26     1    2
1   2  2014-02-25     1    1
2   3  2014-02-26     0    2
3   2  2014-02-26     1    2
4   1  2014-02-25     0    2