Spark数据框多个JOIN

时间:2019-08-05 16:03:18

标签: scala apache-spark hadoop

我想通过比较将2列LONG_IND和SHORT_IND添加到2个数据帧之间的JOIN

  1. 数据帧1中ACCOUNT_NO列的前3个字母前缀值在数据帧2中的Prefix_FROM和Prefix_TO列之间(数字比较)
  2. 数据帧1中ACCOUNT_NO列的后5个字母后缀值在数据帧2中的Suffix_FROM和Suffix_TO列之间(数字比较)
  3. 数据帧1中ACCOUNT_NO列的前3个字母前缀值在数据帧2中的Prefix_FROM和Prefix_TO列之间(字母数字比较)
  4. 数据帧1中ACCOUNT_NO列的后5个字母后缀值在数据帧2中的Suffix_FROM和Suffix_TO列之间(字母数字比较)
  5. 数据框1中ACCOUNT_NO列的后5个字母后缀值应为任何值(数据中为ALL)(字母数字比较)
  6. 数据框1中的ACCOUNT_NO列可以是任何内容(数据中为ALL)(默认情况下)

    如何在同一数据框中添加字母数字比较和默认情况?如果我再次编写单独的JOIN,则LONG_IND和SHORT_IND列

数据框1


ACCOUNT_NO,CostCenter,BU,MPU
0000001F,,BOXXBU          ,BOXXMP          
0000002Q,,BOXXBU          ,BOXXMP          
92115301,,BOXXBU          ,BOXXMP
32934789,,BOXXBU          ,BOXXMP
3FA34789,,BOXXBU          ,BOXXMP
3S534789,,BOXXBU          ,BOXXMP

数据框2


ACCT PFX FROM,ACCT PFX TO,ACCT SFX FROM,ACCT SFX TO,TIER 1 LONG,TIER 2 LONG,TIER 1 SHORT,TIER 2 SHORT
329,329,89276,89276,15,10,65,10
3FA,3FA,00001,00001,1,1,90,1
ALL,ALL,ALL,ALL,8,99,88,99
934,999,ALL,ALL,8,85,88,85
3S4,3S6,ALL,ALL,6,22,65,22

现在,我使用2个选项编写了下面的代码,该代码适用于如下所示的数字( a和b ):

val getRuleDF = accDF.join(customerRulesDF,accDF("ACCOUNT_NO").substr(0, 3).between(customerRulesDF("ACCT_PFX_FROM"), customerRulesDF("ACCT_PFX_TO")) && accDF("ACCOUNT_NO").substr(4, 5).between(customerRulesDF("ACCT_SFX_FROM"), customerRulesDF("ACCT_SFX_TO")), "inner")
  .withColumn("LONG_IND", concatColumns(customerRulesDF("TIER_1_LONG"), customerRulesDF("TIER_2_LONG")) )
  .withColumn("SHORT_IND", concatColumns(customerRulesDF("TIER_1_SHORT"), customerRulesDF("TIER_2_SHORT")) )

OR

val getRule1DF = getRuleDF.join(customerRulesDF,
  (accDF("ACCOUNT_NO").substr(0, 3) >= customerRulesDF("ACCT_PFX_FROM")) &&
  (accDF("ACCOUNT_NO").substr(0, 3) <= customerRulesDF("ACCT_PFX_TO")) &&
  (accDF("ACCOUNT_NO").substr(4, 5) >= customerRulesDF("ACCT_SFX_FROM")) &&
  (accDF("ACCOUNT_NO").substr(4, 5) <=  customerRulesDF("ACCT_SFX_TO")), "inner")
  .withColumn("LONG_IND", concatColumns(customerRulesDF("TIER_1_LONG"), customerRulesDF("TIER_2_LONG")) )
  .withColumn("SHORT_IND", concatColumns(customerRulesDF("TIER_1_SHORT"), customerRulesDF("TIER_2_SHORT")) )

0 个答案:

没有答案