我一直在尝试将三个变量的组合作为唯一的联接键(application_number,application_dt和account_id)联接表。
account_id可以采用空值,但是通过比较其他两个键仍然可以使联接正常工作,因此如果使用null,我将在account_id上合并以将其替换为0。
这在我的桌子上工作,但是我发现了另一个奇怪的问题。
在尝试通过自身连接数据框时,我看到它在进行连接时忽略了account_id,因此表中存在重复项。
我似乎找不到为什么不起作用的原因。
data1 = [[1,'2018-07-31',215,'a'],
[2,'2018-07-30',None,'b'],
[1,'2018-07-31',123,'c'],
]
df_1 = sqlCtx.createDataFrame(data1,
['application_number','application_dt','account_id','Var1'])
df_1 = df_1.withColumn('new_var',F.coalesce(df_1.account_id,F.lit(0)))
df_2 = df_1
join_elem = "df_1.application_number ==
df_2.application_number|df_1.application_dt ==
df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==
F.coalesce(df_2.account_id,F.lit(0))".split("|")
join_elem_column = [eval(x) for x in join_elem]
new = df_1.join(df_2,join_elem_column,'left')
加入的数据框如下所示:
+------------------+--------------+----------+----+-------+------------------+--------------+----------+----+-------+
|application_number|application_dt|account_id|Var1|new_var|application_number|application_dt|account_id|Var1|new_var|
+------------------+--------------+----------+----+-------+------------------+--------------+----------+----+-------+
| 1| 2018-07-31| 215| a| 215| 1| 2018-07-31| 215| a| 215|
| 1| 2018-07-31| 215| a| 215| 1| 2018-07-31| 123| c| 123|
| 1| 2018-07-31| 123| c| 123| 1| 2018-07-31| 215| a| 215|
| 1| 2018-07-31| 123| c| 123| 1| 2018-07-31| 123| c| 123|
| 2| 2018-07-30| null| b| 0| 2| 2018-07-30| null| b| 0|
仿佛我在尝试:
join_elem = "df_1.application_number == df_2.application_number|df_1.application_dt == df_2.application_dt|df_1.new_var == df_2.new_var".split("|")
join_elem_column = [eval(x) for x in join_elem]
new = df_1.join(df_2,join_elem_column,'left')
结果如下:
------------------+--------------+----------+----+-------+------------------+--------------+----------+----+-------+
|application_number|application_dt|account_id|Var1|new_var|application_number|application_dt|account_id|Var1|new_var|
+------------------+--------------+----------+----+-------+------------------+--------------+----------+----+-------+
| 1| 2018-07-31| 215| a| 215| 1| 2018-07-31| 215| a| 215|
| 1| 2018-07-31| 123| c| 123| 1| 2018-07-31| 123| c| 123|
| 2| 2018-07-30| null| b| 0| 2| 2018-07-30| null| b| 0|
+------------------+--------------+----------+----+-------+------------------+--------------+----------+----+-------+