熊猫-沿非索引轴连接两个df,合并非索引轴上具有相同值的行

时间:2019-12-30 19:12:33

标签: python pandas

我有两个熊猫数据框,我想合并成一个。我希望将结果数据框沿非索引轴排序(在我的情况下为'seconds_since_start')。我想合并'seconds_since_start'具有相同值的行。我还想保留两个数据框之间的唯一列。

显示给定的输入和所需的输出可能会更容易。


df_a = """
valid_a,value_a,seconds_since_start
2000-02-15 14:47:00,12.3,0.0
2000-02-15 15:59:00,20.6,30.0
2000-02-15 16:51:00,20.3,120.0
2000-02-15 17:52:00,22.6,200.0
"""

df_b = """
valid_b,value_b,seconds_since_start
2019-12-24 14:54:00,12.4,20.0
2019-12-24 15:54:00,18.7,30.0
2019-12-24 16:54:00,19.2,90.0
2019-12-24 17:54:00,20.8,250.0
"""

df_desired_output = """
valid_a,valid_b,value_a,value_b,seconds_since_start
2000-02-15 14:47:00,,12.3,,0.0
,2019-12-24 14:54:00,,12.4,20.0
2000-02-15 15:59:00,2019-12-24 15:54:00,20.6,18.7,30.0
,2019-12-24 16:54:00,,19.2,90.0
2000-02-15 16:51:00,,20.3,,120.0
2000-02-15 17:52:00,,22.6,,200.0
,2019-12-24 17:54:00,,20.8,250.0
"""

from io import StringIO
import pandas as pd
import numpy as np

df_a = StringIO(df_a)
df_a = pd.read_csv(df_a)
df_a['valid_a'] = pd.to_datetime(df_a['valid_a'])  # convert 'valid' column to pd.datetime objects
df_a = df_a.set_index('valid_a')  # set the 'valid' as index

df_b = StringIO(df_b)
df_b = pd.read_csv(df_b)
df_b['valid_b'] = pd.to_datetime(df_b['valid_b'])  # convert 'valid' column to pd.datetime objects
df_b = df_b.set_index('valid_b')  # set the 'valid' as index

df_desired_output = StringIO(df_desired_output)
df_desired_output = pd.read_csv(df_desired_output)


print('input dataframe A\n', df_a)
print('input dataframe B\n', df_b)
print('desired output dataframe\n', df_desired_output)

df_new = pd.concat([df_a, df_b], sort=False)  # can't sort by 'seconds_since_start' from here so I do it on the next line
df_new = df_new.sort_values(by='seconds_since_start')  # sort
print('actual output\n', df_new)  # fails to merge rows that have the same value for 'seconds_since_start'

输出

input dataframe A
                      value_a  seconds_since_start
valid_a                                          
2000-02-15 14:47:00     12.3                  0.0
2000-02-15 15:59:00     20.6                 30.0
2000-02-15 16:51:00     20.3                120.0
2000-02-15 17:52:00     22.6                200.0
input dataframe B
                      value_b  seconds_since_start
valid_b                                          
2019-12-24 14:54:00     12.4                 20.0
2019-12-24 15:54:00     18.7                 30.0
2019-12-24 16:54:00     19.2                 90.0
2019-12-24 17:54:00     20.8                250.0
desired output dataframe
                valid_a              valid_b  ...  value_b  seconds_since_start
0  2000-02-15 14:47:00                  NaN  ...      NaN                  0.0
1                  NaN  2019-12-24 14:54:00  ...     12.4                 20.0
2  2000-02-15 15:59:00  2019-12-24 15:54:00  ...     18.7                 30.0
3                  NaN  2019-12-24 16:54:00  ...     19.2                 90.0
4  2000-02-15 16:51:00                  NaN  ...      NaN                120.0
5  2000-02-15 17:52:00                  NaN  ...      NaN                200.0
6                  NaN  2019-12-24 17:54:00  ...     20.8                250.0

[7 rows x 5 columns]
actual output
                      value_a  seconds_since_start  value_b
2000-02-15 14:47:00     12.3                  0.0      NaN
2019-12-24 14:54:00      NaN                 20.0     12.4
2000-02-15 15:59:00     20.6                 30.0      NaN
2019-12-24 15:54:00      NaN                 30.0     18.7
2019-12-24 16:54:00      NaN                 90.0     19.2
2000-02-15 16:51:00     20.3                120.0      NaN
2000-02-15 17:52:00     22.6                200.0      NaN
2019-12-24 17:54:00      NaN                250.0     20.8

4 个答案:

答案 0 :(得分:2)

这里是使用合并的示例。首先重置df_a和df_b中的索引,然后执行外部联接并对值进行排序:

df_a.reset_index().merge(df_b.reset_index(),
                         on=['seconds_since_start'],
                         how='outer').sort_values('seconds_since_start')

              valid_a  value_a  seconds_since_start             valid_b  \
0 2000-02-15 14:47:00     12.3                  0.0                 NaT   
4                 NaT      NaN                 20.0 2019-12-24 14:54:00   
1 2000-02-15 15:59:00     20.6                 30.0 2019-12-24 15:54:00   
5                 NaT      NaN                 90.0 2019-12-24 16:54:00   
2 2000-02-15 16:51:00     20.3                120.0                 NaT   
3 2000-02-15 17:52:00     22.6                200.0                 NaT   
6                 NaT      NaN                250.0 2019-12-24 17:54:00   

   value_b  
0      NaN  
4     12.4  
1     18.7  
5     19.2  
2      NaN  
3      NaN  
6     20.8  

答案 1 :(得分:1)

假设seconds_since_startdf_adf_b中是唯一的:

col = 'seconds_since_start'

s = pd.concat([df_a[col], df_b[col]]).sort_values().to_frame()
output = s.merge(df_a, on=col, how='left') \
          .merge(df_b, on=col, how='left')

结果:

   seconds_since_start              valid_a  value_a              valid_b  value_b
0                  0.0  2000-02-15 14:47:00     12.3                  NaN      NaN
1                 20.0                  NaN      NaN  2019-12-24 14:54:00     12.4
2                 30.0  2000-02-15 15:59:00     20.6  2019-12-24 15:54:00     18.7
3                 30.0  2000-02-15 15:59:00     20.6  2019-12-24 15:54:00     18.7
4                 90.0                  NaN      NaN  2019-12-24 16:54:00     19.2
5                120.0  2000-02-15 16:51:00     20.3                  NaN      NaN
6                200.0  2000-02-15 17:52:00     22.6                  NaN      NaN
7                250.0                  NaN      NaN  2019-12-24 17:54:00     20.8

答案 2 :(得分:1)

只需将索引添加到列

df_new = pd.concat([df_a.assign(valid_a=df_a.index), df_b.assign(valid_b=df_b.index)], sort=False)  
df_new = df_new.sort_values(by='seconds_since_start')

答案 3 :(得分:1)

它只是合并:

pd.merge(df_a.reset_index(), 
         df_b.reset_index(), 
         on='seconds_since_start', 
         how='outer')

输出:

    valid_a                value_a    seconds_since_start  valid_b                value_b
--  -------------------  ---------  ---------------------  -------------------  ---------
 0  2000-02-15 14:47:00       12.3                      0  NaT                      nan
 1  2000-02-15 15:59:00       20.6                     30  2019-12-24 15:54:00       18.7
 2  2000-02-15 16:51:00       20.3                    120  NaT                      nan
 3  2000-02-15 17:52:00       22.6                    200  NaT                      nan
 4  NaT                      nan                       20  2019-12-24 14:54:00       12.4
 5  NaT                      nan                       90  2019-12-24 16:54:00       19.2
 6  NaT                      nan                      250  2019-12-24 17:54:00       20.8