基本上,有两个数据集Global和Local,每个数据集约有65,000行。我正在尝试基于全局数据中的主键合并两个数据集。
Global:
Primary key Name
234 ABC ltd
344 GHF ltd
566 THD ltd
677 FGG ltd
4666 JKD ltd
Local:
Primary key Country Status Date
234 USA Completed 1/8/2018
234 CAN Pending 3/5/2019
344 USA Pending 8/8/2019
344 CAN Completed 6/5/2018
566 USA Pending 3/5/2019
566 CAN Completed 8/8/2019
677 USA Pending 8/8/2019
4666 USA Completed 1/8/2018
4666 CAN Completed 1/8/2018
Merge:
Primary key Name USA Status USA Date CAN Status CAN Date
234 ABC ltd Completed 1/8/2018 Pending 3/5/2019
344 GHF ltd Pending 8/8/2019 Completed 6/5/2018
566 THD ltd Pending 3/5/2019 Completed 8/8/2019
677 FGG ltd Pending 8/8/2019 - -
4666 JKD ltd Completed 1/8/2018 Completed 1/8/2018
因此,在这里我仅显示了两个国家,而我正在处理的是四个地区。
为此,我将“本地”数据集分为两个不同的数据帧,然后将其与全局数据合并。到目前为止我所做的代码如下。
import pandas as pd
Global = pd.read_excel("path to upload global data")
Local = pd.read_excel("path to upload local data")
df1 = Local.loc[Local['Country'] == "USA"]
df2 = Local.loc[Local['Country'] == "CAN"]
usa = df1.rename(columns={"Status": "USA Status", "Date": "USA Date"})
can = df2.rename(columns={"Status": "CAN Status", "Date": "CAN Date"})
r1 = pd.merge(Global,usa,on = "Primary Key",how = "left")
result = pd.merge(r1,can,on = "Primary Key",how = "left")
看起来只有两个,但是有多个区域,看起来很整洁,这会减慢处理时间,并局促代码。
答案 0 :(得分:1)
您可以执行以下操作:
df = df1.merge(df2, on='key')
df = df.set_index(['key', 'name', 'country']).unstack('country')
df = df[sorted(df.columns, key=lambda x: x[1])]
print(df)
status date status date
country CAN CAN USA USA
key name
234 ABC ltd Pending 3/5/2019 Completed 1/8/2018
344 GHF ltd Completed 6/5/2018 Pending 8/8/2019
566 THD ltd Completed 8/8/2019 Pending 3/5/2019
677 FGG ltd NaN NaN Pending 8/8/2019
4666 JKD ltd Completed 1/8/2018 Completed 1/8/2018