我有2个pandas数据帧--df_current_data,df_new_data。
我的目标是应用合并(不是pandas合并功能,合并类似' update \ insert')。检查匹配是按键列。
我的结果需要由3个可选的行类型构建。
存在于df_current_data但df_new_data中不存在的行将插入"按原样#34;结果。
存在于df_new_data但df_current_data中不存在的行将插入"按原样#34;结果。
存在于df_new_data中且存在于df_current_data中的行 - 结果需要从df_new_data获取行。
这是一个经典的merge-upsert动作。
示例:
# rows 0,1 are in current and not in new (check by index1 and index2)
# row 2 is common
In [41]: df_current_source
Out[41]: A index1 index2
0 1 1 4
1 2 2 5
2 3 3 6
# rows 0,2 are in new and not in current (check by index1 and index2)
# row 1 is common
In [42]: df_new_source
Out[42]: A index1 index2
0 4 2 7
1 5 3 6
2 6 4 5
# the result has 2 rows that only in current (rows 0,1)
# the result has 2 rows that only in new (rows 3,4)
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2)
In [43]: df_result
Out[43]: A index1 index2
0 1 1 4
1 2 2 5
2 5 3 6
3 4 2 7
4 6 4 5
这就是我的所作所为:
# left join from source to new
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys,
right_on=p_curr_keys, indicator=True)
# take only the rows that exists in the current and not exists in the source
df_only_current = df[df['_merge'] == 'left_only']
# merge new data into current data
df_result = pd.concat([df_only_current, df_new_source])
另一个选项是isin函数:
df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\
.isin(df_new_source[p_key_col_name])], df_new_source])
问题是,如果我有超过1个键列,我就不能使用isin,我需要合并。
假设新电流的电流要大得多,我猜最好的方法是直接用new行更新匹配的电流行,然后追加" new"的新行。数据帧进入当前。
但我不确定该怎么做..
非常感谢。
答案 0 :(得分:0)
indicator=True
作为merge
的一部分:df_out = df_current_source.merge(df_new_source,
on=['index1', 'index2'],
how='outer', indicator=True)
df_out['A'] = np.where(df_out['_merge'] == 'both',
df_out['A_y'],
df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int)
df_out[['A', 'index1', 'index2']]
输出:
A index1 index2
0 1 1 4
1 2 2 5
2 5 3 6
3 4 2 7
4 6 4 5
combined_first
与set_index
df_new_source.set_index(['index1', 'index2'])\
.combine_first(df_current_source.set_index(['index1', 'index2']))\
.reset_index()\
.astype(int)
输出:
index1 index2 A
0 1 4 1
1 2 5 2
2 2 7 4
3 3 6 5
4 4 5 6
答案 1 :(得分:0)
点击此链接join or merge with overwrite in pandas。你可以使用combine_first:
if datename(weekday,mydate)='monday'
then mydate=dateadd(day,today()-3)
elseif mydate=dateadd(day,today()-1)
end
输出
combined_dataframe = df_new_source.set_index('A').combine_first(df_current_source.set_index('A'))
combined_dataframe.reset_index()