scala - Spark数据框多个JOIN - Thinbug

Spark数据框多个JOIN

时间：2019-08-05 16:03:18

标签： scala apache-spark hadoop

我想通过比较将2列LONG_IND和SHORT_IND添加到2个数据帧之间的JOIN

数据帧1中ACCOUNT_NO列的前3个字母前缀值在数据帧2中的Prefix_FROM和Prefix_TO列之间（数字比较）
数据帧1中ACCOUNT_NO列的后5个字母后缀值在数据帧2中的Suffix_FROM和Suffix_TO列之间（数字比较）
数据帧1中ACCOUNT_NO列的前3个字母前缀值在数据帧2中的Prefix_FROM和Prefix_TO列之间（字母数字比较）
数据帧1中ACCOUNT_NO列的后5个字母后缀值在数据帧2中的Suffix_FROM和Suffix_TO列之间（字母数字比较）
数据框1中ACCOUNT_NO列的后5个字母后缀值应为任何值（数据中为ALL）（字母数字比较）
数据框1中的ACCOUNT_NO列可以是任何内容（数据中为ALL）（默认情况下）

如何在同一数据框中添加字母数字比较和默认情况？如果我再次编写单独的JOIN，则LONG_IND和SHORT_IND列

数据框1

ACCOUNT_NO,CostCenter,BU,MPU
0000001F,,BOXXBU          ,BOXXMP          
0000002Q,,BOXXBU          ,BOXXMP          
92115301,,BOXXBU          ,BOXXMP
32934789,,BOXXBU          ,BOXXMP
3FA34789,,BOXXBU          ,BOXXMP
3S534789,,BOXXBU          ,BOXXMP

数据框2

ACCT PFX FROM,ACCT PFX TO,ACCT SFX FROM,ACCT SFX TO,TIER 1 LONG,TIER 2 LONG,TIER 1 SHORT,TIER 2 SHORT
329,329,89276,89276,15,10,65,10
3FA,3FA,00001,00001,1,1,90,1
ALL,ALL,ALL,ALL,8,99,88,99
934,999,ALL,ALL,8,85,88,85
3S4,3S6,ALL,ALL,6,22,65,22

现在，我使用2个选项编写了下面的代码，该代码适用于如下所示的数字（ a和b ）：

val getRuleDF = accDF.join(customerRulesDF,accDF("ACCOUNT_NO").substr(0, 3).between(customerRulesDF("ACCT_PFX_FROM"), customerRulesDF("ACCT_PFX_TO")) && accDF("ACCOUNT_NO").substr(4, 5).between(customerRulesDF("ACCT_SFX_FROM"), customerRulesDF("ACCT_SFX_TO")), "inner")
  .withColumn("LONG_IND", concatColumns(customerRulesDF("TIER_1_LONG"), customerRulesDF("TIER_2_LONG")) )
  .withColumn("SHORT_IND", concatColumns(customerRulesDF("TIER_1_SHORT"), customerRulesDF("TIER_2_SHORT")) )

OR

val getRule1DF = getRuleDF.join(customerRulesDF,
  (accDF("ACCOUNT_NO").substr(0, 3) >= customerRulesDF("ACCT_PFX_FROM")) &&
  (accDF("ACCOUNT_NO").substr(0, 3) <= customerRulesDF("ACCT_PFX_TO")) &&
  (accDF("ACCOUNT_NO").substr(4, 5) >= customerRulesDF("ACCT_SFX_FROM")) &&
  (accDF("ACCOUNT_NO").substr(4, 5) <=  customerRulesDF("ACCT_SFX_TO")), "inner")
  .withColumn("LONG_IND", concatColumns(customerRulesDF("TIER_1_LONG"), customerRulesDF("TIER_2_LONG")) )
  .withColumn("SHORT_IND", concatColumns(customerRulesDF("TIER_1_SHORT"), customerRulesDF("TIER_2_SHORT")) )

0 个答案:

没有答案