我有两个熊猫数据框,我想合并成一个。我希望将结果数据框沿非索引轴排序(在我的情况下为'seconds_since_start'
)。我想合并'seconds_since_start'
具有相同值的行。我还想保留两个数据框之间的唯一列。
显示给定的输入和所需的输出可能会更容易。
df_a = """
valid_a,value_a,seconds_since_start
2000-02-15 14:47:00,12.3,0.0
2000-02-15 15:59:00,20.6,30.0
2000-02-15 16:51:00,20.3,120.0
2000-02-15 17:52:00,22.6,200.0
"""
df_b = """
valid_b,value_b,seconds_since_start
2019-12-24 14:54:00,12.4,20.0
2019-12-24 15:54:00,18.7,30.0
2019-12-24 16:54:00,19.2,90.0
2019-12-24 17:54:00,20.8,250.0
"""
df_desired_output = """
valid_a,valid_b,value_a,value_b,seconds_since_start
2000-02-15 14:47:00,,12.3,,0.0
,2019-12-24 14:54:00,,12.4,20.0
2000-02-15 15:59:00,2019-12-24 15:54:00,20.6,18.7,30.0
,2019-12-24 16:54:00,,19.2,90.0
2000-02-15 16:51:00,,20.3,,120.0
2000-02-15 17:52:00,,22.6,,200.0
,2019-12-24 17:54:00,,20.8,250.0
"""
from io import StringIO
import pandas as pd
import numpy as np
df_a = StringIO(df_a)
df_a = pd.read_csv(df_a)
df_a['valid_a'] = pd.to_datetime(df_a['valid_a']) # convert 'valid' column to pd.datetime objects
df_a = df_a.set_index('valid_a') # set the 'valid' as index
df_b = StringIO(df_b)
df_b = pd.read_csv(df_b)
df_b['valid_b'] = pd.to_datetime(df_b['valid_b']) # convert 'valid' column to pd.datetime objects
df_b = df_b.set_index('valid_b') # set the 'valid' as index
df_desired_output = StringIO(df_desired_output)
df_desired_output = pd.read_csv(df_desired_output)
print('input dataframe A\n', df_a)
print('input dataframe B\n', df_b)
print('desired output dataframe\n', df_desired_output)
df_new = pd.concat([df_a, df_b], sort=False) # can't sort by 'seconds_since_start' from here so I do it on the next line
df_new = df_new.sort_values(by='seconds_since_start') # sort
print('actual output\n', df_new) # fails to merge rows that have the same value for 'seconds_since_start'
输出
input dataframe A
value_a seconds_since_start
valid_a
2000-02-15 14:47:00 12.3 0.0
2000-02-15 15:59:00 20.6 30.0
2000-02-15 16:51:00 20.3 120.0
2000-02-15 17:52:00 22.6 200.0
input dataframe B
value_b seconds_since_start
valid_b
2019-12-24 14:54:00 12.4 20.0
2019-12-24 15:54:00 18.7 30.0
2019-12-24 16:54:00 19.2 90.0
2019-12-24 17:54:00 20.8 250.0
desired output dataframe
valid_a valid_b ... value_b seconds_since_start
0 2000-02-15 14:47:00 NaN ... NaN 0.0
1 NaN 2019-12-24 14:54:00 ... 12.4 20.0
2 2000-02-15 15:59:00 2019-12-24 15:54:00 ... 18.7 30.0
3 NaN 2019-12-24 16:54:00 ... 19.2 90.0
4 2000-02-15 16:51:00 NaN ... NaN 120.0
5 2000-02-15 17:52:00 NaN ... NaN 200.0
6 NaN 2019-12-24 17:54:00 ... 20.8 250.0
[7 rows x 5 columns]
actual output
value_a seconds_since_start value_b
2000-02-15 14:47:00 12.3 0.0 NaN
2019-12-24 14:54:00 NaN 20.0 12.4
2000-02-15 15:59:00 20.6 30.0 NaN
2019-12-24 15:54:00 NaN 30.0 18.7
2019-12-24 16:54:00 NaN 90.0 19.2
2000-02-15 16:51:00 20.3 120.0 NaN
2000-02-15 17:52:00 22.6 200.0 NaN
2019-12-24 17:54:00 NaN 250.0 20.8
答案 0 :(得分:2)
这里是使用合并的示例。首先重置df_a和df_b中的索引,然后执行外部联接并对值进行排序:
df_a.reset_index().merge(df_b.reset_index(),
on=['seconds_since_start'],
how='outer').sort_values('seconds_since_start')
valid_a value_a seconds_since_start valid_b \
0 2000-02-15 14:47:00 12.3 0.0 NaT
4 NaT NaN 20.0 2019-12-24 14:54:00
1 2000-02-15 15:59:00 20.6 30.0 2019-12-24 15:54:00
5 NaT NaN 90.0 2019-12-24 16:54:00
2 2000-02-15 16:51:00 20.3 120.0 NaT
3 2000-02-15 17:52:00 22.6 200.0 NaT
6 NaT NaN 250.0 2019-12-24 17:54:00
value_b
0 NaN
4 12.4
1 18.7
5 19.2
2 NaN
3 NaN
6 20.8
答案 1 :(得分:1)
假设seconds_since_start
在df_a
和df_b
中是唯一的:
col = 'seconds_since_start'
s = pd.concat([df_a[col], df_b[col]]).sort_values().to_frame()
output = s.merge(df_a, on=col, how='left') \
.merge(df_b, on=col, how='left')
结果:
seconds_since_start valid_a value_a valid_b value_b
0 0.0 2000-02-15 14:47:00 12.3 NaN NaN
1 20.0 NaN NaN 2019-12-24 14:54:00 12.4
2 30.0 2000-02-15 15:59:00 20.6 2019-12-24 15:54:00 18.7
3 30.0 2000-02-15 15:59:00 20.6 2019-12-24 15:54:00 18.7
4 90.0 NaN NaN 2019-12-24 16:54:00 19.2
5 120.0 2000-02-15 16:51:00 20.3 NaN NaN
6 200.0 2000-02-15 17:52:00 22.6 NaN NaN
7 250.0 NaN NaN 2019-12-24 17:54:00 20.8
答案 2 :(得分:1)
只需将索引添加到列
df_new = pd.concat([df_a.assign(valid_a=df_a.index), df_b.assign(valid_b=df_b.index)], sort=False)
df_new = df_new.sort_values(by='seconds_since_start')
答案 3 :(得分:1)
它只是合并:
pd.merge(df_a.reset_index(),
df_b.reset_index(),
on='seconds_since_start',
how='outer')
输出:
valid_a value_a seconds_since_start valid_b value_b
-- ------------------- --------- --------------------- ------------------- ---------
0 2000-02-15 14:47:00 12.3 0 NaT nan
1 2000-02-15 15:59:00 20.6 30 2019-12-24 15:54:00 18.7
2 2000-02-15 16:51:00 20.3 120 NaT nan
3 2000-02-15 17:52:00 22.6 200 NaT nan
4 NaT nan 20 2019-12-24 14:54:00 12.4
5 NaT nan 90 2019-12-24 16:54:00 19.2
6 NaT nan 250 2019-12-24 17:54:00 20.8