如何通过索引合并列重叠的值来合并许多DataFrame?

时间:2019-10-07 18:21:11

标签: python pandas dataframe

我有很多需要合并的数据框。

让我们说:

base: id  constraint
      1   'a'
      2   'b'
      3   'c'

df_1: id value constraint
      1  1     'a'
      2  2     'a'
      3  3     'a'

df_2: id value constraint
      1  1     'b'
      2  2     'b'
      3  3     'b'


df_3: id value constraint
      1  1     'c'
      2  2     'c'
      3  3     'c'

如果我尝试将它们全部合并(将处于循环状态),则会得到:

a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value   value_x  value_y
1  'a'        1       NaN      NaN
2  'b'        NaN     2        NaN
3  'c'        NaN     NaN      3

所需的输出将是:

id constraint value
1  'a'        1 
2  'b'        2
3  'c'        3

我知道combine_first并可以使用,但是我不能采用这种方法,因为它要慢数千倍。

是否有merge可以在列重叠的情况下替换值?

它与this question有点相似,没有答案。

4 个答案:

答案 0 :(得分:3)

给出您的MCVE:

import pandas as pd

base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])

我建议先连接您的数据框(必要时使用循环):

df = pd.concat([df1, df2, df3])

然后合并:

pd.merge(base, df, on='id')

它产生:

   id  value
0   1      1
1   2      2
2   3      3

更新

使用新版本的问题和@Celius Stingher提供的输入来运行代码:

a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)

我们得到:

   id constrains  value
0   1          a      1
1   2          b      2
2   3          c      3

似乎符合您的预期输出。

答案 1 :(得分:3)

您可以将ffill()用于此目的:

df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])

(pd.concat((df_1,df_2,df_3), axis=1)
   .ffill(1)
   .iloc[:,-1]
)

输出:

1    1.0
2    2.0
3    3.0
Name: val, dtype: float64

对于您的新数据:

base.merge(pd.concat((df1,df2,df3)),
           on=['id','constraint'],
           how='left')

输出:

   id constraint  value
0   1        'a'      1
1   2        'b'      2
2   3        'c'      3

结论:您实际上是在how='left'中寻找选项merge

答案 2 :(得分:1)

如果您必须仅将所有数据框与基本合并:

基于修改

import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)

dataframes = [df_1,df_2,df_3]
for i in dataframes:
    base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)

输出:

   id constrains  value
0   1          a    1.0
1   2          b    2.0
2   3          c    3.0

答案 3 :(得分:0)

对于那些只想做merge的人来说,重写值(这是我的情况),可以使用此方法来实现这一点,该方法与Celius Stingher answer十分相似。

文档版本在the original gist上。

import pandas as pa

def rmerge(left,right,**kwargs):
    # Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
    def flatten(lst):
        return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )

    # Set default for removing overlapping columns in "left" to be true
    myargs = {'replace':'left'}
    myargs.update(kwargs)

    # Remove the replace key from the argument dict to be sent to
    # pandas merge command
    kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}

    if myargs['replace'] is not None:
        # Generate a list of overlapping column names not associated with the join
        skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
        leftcols = set(left.columns)
        rightcols = set(right.columns)
        dropcols = list((leftcols & rightcols).difference(skipcols))

        # Remove the overlapping column names from the appropriate DataFrame
        if myargs['replace'].lower() == 'left':
            left = left.copy().drop(dropcols,axis=1)
        elif myargs['replace'].lower() == 'right':
            right = right.copy().drop(dropcols,axis=1)

    df = pa.merge(left,right,**kwargs)

    return df