基于公共列和索引的熊猫合并

时间:2019-11-14 16:34:33

标签: python pandas

我想像这样合并两个大数据框:

            loc  val
2019-09-01  0    23.2
2019-09-02  0    13.2
...
2019-11-01  0    12.9
2019-09-01  1    21.2
2019-09-01  1    26.7
...
2019-11-01  1    13.5
...
2019-09-01  4    23.4
...
2019-11-01  4    17.8

因此,换句话说,作为索引,我每个loc都有很多日期时间,loc的范围是0到4。

我有2个这些数据框。我想同时通过loc列加入它们,但我想以一种内部方式考虑索引。因此,如果我有第二个数据帧:

            loc  val
2019-09-02  0    54.8
2019-09-03  0    11.7
...

因此合并将类似于:

            loc  val    val
2019-09-01  0    23.2   NaN
2019-09-02  0    13.2   54.8
...

您知道这是否可能吗?我想要这样的东西(有可能):

df = pd.merge(df1, df2, on="loc", left_index=True, right_index=True)

我一直在用merge进行测试,但是我不知道该怎么做。谢谢。

2 个答案:

答案 0 :(得分:3)

IIUC,

我们可以将轴重命名为一个通用的索引名称,我尝试在空白索引上进行合并,但是我无法弄清楚,

然后,我们合并到您的'loc'列+新命名的'date'索引上。

您听起来好像知道合并,所以请更改行为以符合您的要求。

df.rename_axis('date',inplace=True)
df1.rename_axis('date',inplace=True)
pd.merge(df,df1,on=['loc','date'],how='left',indicator=True)
out:


           loc  val_x  val_y     _merge
date                                    
2019-09-01  0.0   23.2    NaN  left_only
2019-09-02  0.0   13.2   54.8       both
2019-11-01  0.0   12.9    NaN  left_only
2019-09-01  1.0   21.2    NaN  left_only
2019-09-01  1.0   26.7    NaN  left_only
2019-11-01  1.0   13.5    NaN  left_only
2019-09-01  4.0   23.4    NaN  left_only
2019-11-01  4.0   17.8    NaN  left_only

答案 1 :(得分:2)

您可以尝试以下操作:

df_1 = df_1.reset_index().rename(columns={'index':'dates'}) #Creates columns from the index, and then rename it to `dates`
df_2 = df_2.reset_index().rename(columns={'index':'dates'}) #Same as first line

df_output = df_1.merge(df_2,how='inner',left_on=['loc','dates'],right_on=['loc','dates']) #Finally perform the inner join based on both columns.

这将导致所需的输出。我正在创建示例集以更好地说明它。

import pandas as pd
d_1 = {'index':['2019-09-02','2019-09-03'],'loc':[0,0],'val':[23.2,13.2]}
d_2 = {'index':['2019-09-02','2019-09-03','2019-09-05'],'loc':[0,0,0],'val':[54.8,10,13]}
df_1 = pd.DataFrame(d_1)
df_2 = pd.DataFrame(d_2)
df_1 = df_1.set_index('index') #This is your data
df_2 = df_2.set_index('index') #This is your data
print(df_1)
print(df_2)
df_1 = df_1.reset_index().rename(columns={'index':'dates'})
df_2 = df_2.reset_index().rename(columns={'index':'dates'})

final_df = df_2.merge(df_1,how='inner',left_on=['dates','loc'],right_on=['dates','loc'])
print(final_df)

这是输出:

        dates  loc  val_x  val_y
0  2019-09-02    0   54.8   23.2
1  2019-09-03    0   10.0   13.2

但是:

对于您的预期输出以及给定的信息,left联接将更容易地满足要求。有了这些信息:

d_1 = {'index':['2019-09-01','2019-09-02'],'loc':[0,0],'val':[23.2,13.2]}
d_2 = {'index':['2019-09-02','2019-09-03'],'loc':[0,0],'val':[54.8,11.7]}
final_df = df_2.merge(df_1,how='left',left_on=['dates','loc'],right_on=['dates','loc'])
print(final_df)

输出:

        dates  loc  val_x  val_y
0  2019-09-02    0   54.8   13.2
1  2019-09-03    0   11.7    NaN