我正在尝试合并包含时间序列数据的多个数据帧。这些数据帧最多可以具有100列和大约5000行。两种示例数据帧是
df1 = pd.DataFrame({'SubjectID': ['A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-10', '2010-05-08', '2010-05-08'], 'Test1':[1, 2, 3, 4], 'Gender': ['M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1]})
df2 = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-09', '2010-05-10', '2010-05-08', '2010-05-09'], 'Test2': [1, 2, 3, 4, 5], 'Gender': ['M', 'M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1, 1]})
df1
SubjectID Date Test1 Gender StudyID
0 A 2010-05-08 1 M 1
1 A 2010-05-10 2 M 1
2 B 2010-05-08 3 M 1
3 C 2010-05-08 4 F 1
df2
SubjectID Date Test2 Gender StudyID
0 A 2010-05-08 1 M 1
1 A 2010-05-09 2 M 1
2 A 2010-05-10 3 M 1
3 B 2010-05-08 4 M 1
4 C 2010-05-09 5 F 1
我的预期输出是
SubjectID Date Test1 Gender StudyID Test2
0 A 2010-05-08 1.0 M 1.0 1.0
1 A 2010-05-09 NaN M 1.0 2.0
2 A 2010-05-10 2.0 M 1.0 3.0
3 B 2010-05-08 3.0 M 1.0 4.0
4 C 2010-05-08 4.0 F 1.0 NaN
5 C 2010-05-09 NaN F 1.0 5.0
我要加入数据框
merged_df = df1.set_index(['SubjectID', 'Date']).join(df2.set_index(['SubjectID', 'Date']), how = 'outer', lsuffix = '_l', rsuffix = '_r').reset_index()
但我的输出是
SubjectID Date Test1 Gender_l StudyID_l Test2 Gender_r StudyID_r
0 A 2010-05-08 1.0 M 1.0 1.0 M 1.0
1 A 2010-05-09 NaN NaN NaN 2.0 M 1.0
2 A 2010-05-10 2.0 M 1.0 3.0 M 1.0
3 B 2010-05-08 3.0 M 1.0 4.0 M 1.0
4 C 2010-05-08 4.0 F 1.0 NaN NaN NaN
5 C 2010-05-09 NaN NaN NaN 5.0 F 1.0
如果两个数据帧中的所有值都相等,是否有一种在合并数据帧时合并列的方法?我可以在合并后执行此操作,但这对于我的大型数据集将变得很乏味。
答案 0 :(得分:1)
这取决于您要如何实现解决可能不完全匹配的信息的逻辑。如果您合并了几帧,我认为采用modal
值是合适的。使用您的merged_df
,我们可以将其解析为:
merged_df = merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])
Date Gender StudyID SubjectID Test1 Test2
0 2010-05-08 M 1.0 A 1.0 1.0
1 2010-05-09 M 1.0 A NaN 2.0
2 2010-05-10 M 1.0 A 2.0 3.0
3 2010-05-08 M 1.0 B 3.0 4.0
4 2010-05-08 F 1.0 C 4.0 NaN
5 2010-05-09 F 1.0 C NaN 5.0
或者,也许您想优先考虑第一帧中的非空值,那么它就是.combine_first
。
df1.set_index(['SubjectID', 'Date']).combine_first(df2.set_index(['SubjectID', 'Date']))
Gender StudyID Test1 Test2
SubjectID Date
A 2010-05-08 M 1.0 1.0 1.0
2010-05-09 M 1.0 NaN 2.0
2010-05-10 M 1.0 2.0 3.0
B 2010-05-08 M 1.0 3.0 4.0
C 2010-05-08 F 1.0 4.0 NaN
2010-05-09 F 1.0 NaN 5.0
如果您必须合并许多DataFrames
,最好使用functools中的reduce
。
from functools import reduce
merged_df = reduce(lambda l,r: l.merge(r, on=['SubjectID', 'Date'], how='outer', suffixes=['_l', '_r']),
[df1, df2 ,df1, df2, df2])
您将有很多重叠的列,但仍然可以解决它们:
merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])
Date Gender StudyID SubjectID Test1 Test2
0 2010-05-08 M 1.0 A 1.0 1.0
1 2010-05-10 M 1.0 A 2.0 3.0
2 2010-05-08 M 1.0 B 3.0 4.0
3 2010-05-08 F 1.0 C 4.0 NaN
4 2010-05-09 M 1.0 A NaN 2.0
5 2010-05-09 F 1.0 C NaN 5.0