连接时间序列数据帧,其中重复的列包含相同的值

时间:2019-01-18 16:14:41

标签: python-3.x pandas join time-series

我正在尝试合并包含时间序列数据的多个数据帧。这些数据帧最多可以具有100列和大约5000行。两种示例数据帧是

df1 = pd.DataFrame({'SubjectID': ['A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-10', '2010-05-08', '2010-05-08'], 'Test1':[1, 2, 3, 4], 'Gender': ['M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1]})

df2 = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-09', '2010-05-10', '2010-05-08', '2010-05-09'], 'Test2': [1, 2, 3, 4, 5], 'Gender': ['M', 'M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1, 1]})

df1
    SubjectID   Date    Test1   Gender  StudyID
0         A     2010-05-08  1   M   1
1         A     2010-05-10  2   M   1
2         B     2010-05-08  3   M   1
3         C     2010-05-08  4   F   1

df2
    SubjectID   Date    Test2   Gender  StudyID
0   A   2010-05-08  1   M   1
1   A   2010-05-09  2   M   1
2   A   2010-05-10  3   M   1
3   B   2010-05-08  4   M   1
4   C   2010-05-09  5   F   1

我的预期输出是

SubjectID   Date    Test1   Gender  StudyID     Test2   
0   A   2010-05-08  1.0     M   1.0     1.0     
1   A   2010-05-09  NaN     M   1.0     2.0     
2   A   2010-05-10  2.0     M   1.0     3.0     
3   B   2010-05-08  3.0     M   1.0     4.0     
4   C   2010-05-08  4.0     F   1.0     NaN     
5   C   2010-05-09  NaN     F   1.0     5.0     

我要加入数据框

merged_df = df1.set_index(['SubjectID', 'Date']).join(df2.set_index(['SubjectID', 'Date']), how = 'outer', lsuffix = '_l', rsuffix = '_r').reset_index()

但我的输出是

  SubjectID     Date    Test1   Gender_l    StudyID_l   Test2   Gender_r    StudyID_r
0         A     2010-05-08  1.0     M   1.0     1.0     M   1.0
1         A     2010-05-09  NaN    NaN  NaN     2.0     M   1.0
2         A     2010-05-10  2.0     M   1.0     3.0     M   1.0
3         B     2010-05-08  3.0     M   1.0     4.0     M   1.0
4         C     2010-05-08  4.0     F   1.0     NaN     NaN NaN
5         C     2010-05-09  NaN    NaN  NaN     5.0     F   1.0

如果两个数据帧中的所有值都相等,是否有一种在合并数据帧时合并列的方法?我可以在合并后执行此操作,但这对于我的大型数据集将变得很乏味。

1 个答案:

答案 0 :(得分:1)

这取决于您要如何实现解决可能不完全匹配的信息的逻辑。如果您合并了几帧,我认为采用modal值是合适的。使用您的merged_df,我们可以将其解析为:

merged_df = merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])

         Date Gender  StudyID SubjectID  Test1  Test2
0  2010-05-08      M      1.0         A    1.0    1.0
1  2010-05-09      M      1.0         A    NaN    2.0
2  2010-05-10      M      1.0         A    2.0    3.0
3  2010-05-08      M      1.0         B    3.0    4.0
4  2010-05-08      F      1.0         C    4.0    NaN
5  2010-05-09      F      1.0         C    NaN    5.0

或者,也许您想优先考虑第一帧中的非空值,那么它就是.combine_first

df1.set_index(['SubjectID', 'Date']).combine_first(df2.set_index(['SubjectID', 'Date']))

                     Gender  StudyID  Test1  Test2
SubjectID Date                                    
A         2010-05-08      M      1.0    1.0    1.0
          2010-05-09      M      1.0    NaN    2.0
          2010-05-10      M      1.0    2.0    3.0
B         2010-05-08      M      1.0    3.0    4.0
C         2010-05-08      F      1.0    4.0    NaN
          2010-05-09      F      1.0    NaN    5.0

如果您必须合并许多DataFrames,最好使用functools中的reduce

from functools import reduce

merged_df = reduce(lambda l,r: l.merge(r, on=['SubjectID', 'Date'], how='outer', suffixes=['_l', '_r']), 
                   [df1, df2 ,df1, df2, df2])

您将有很多重叠的列,但仍然可以解决它们:

merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])

         Date Gender  StudyID SubjectID  Test1  Test2
0  2010-05-08      M      1.0         A    1.0    1.0
1  2010-05-10      M      1.0         A    2.0    3.0
2  2010-05-08      M      1.0         B    3.0    4.0
3  2010-05-08      F      1.0         C    4.0    NaN
4  2010-05-09      M      1.0         A    NaN    2.0
5  2010-05-09      F      1.0         C    NaN    5.0