我有很多需要合并的数据框。
让我们说:
base: id constraint 1 'a' 2 'b' 3 'c' df_1: id value constraint 1 1 'a' 2 2 'a' 3 3 'a' df_2: id value constraint 1 1 'b' 2 2 'b' 3 3 'b' df_3: id value constraint 1 1 'c' 2 2 'c' 3 3 'c'
如果我尝试将它们全部合并(将处于循环状态),则会得到:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y 1 'a' 1 NaN NaN 2 'b' NaN 2 NaN 3 'c' NaN NaN 3
所需的输出将是:
id constraint value 1 'a' 1 2 'b' 2 3 'c' 3
我知道combine_first
并可以使用,但是我不能采用这种方法,因为它要慢数千倍。
是否有merge
可以在列重叠的情况下替换值?
它与this question有点相似,没有答案。
答案 0 :(得分:3)
给出您的MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
我建议先连接您的数据框(必要时使用循环):
df = pd.concat([df1, df2, df3])
然后合并:
pd.merge(base, df, on='id')
它产生:
id value
0 1 1
1 2 2
2 3 3
使用新版本的问题和@Celius Stingher
提供的输入来运行代码:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
我们得到:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
似乎符合您的预期输出。
答案 1 :(得分:3)
您可以将ffill()
用于此目的:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
输出:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
对于您的新数据:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
输出:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
结论:您实际上是在how='left'
中寻找选项merge
答案 2 :(得分:1)
如果您必须仅将所有数据框与基本合并:
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
输出:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
答案 3 :(得分:0)
对于那些只想做merge
的人来说,重写值(这是我的情况),可以使用此方法来实现这一点,该方法与Celius Stingher answer十分相似。
文档版本在the original gist上。
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df