加入数据框时触发正则表达式

时间:2018-12-21 06:21:54

标签: regex scala apache-spark

我需要在进行一些连接时编写一些正则表达式以进行状态检查

我的正则表达式应匹配字符串

n3_testindia1 = test-india-1
n2_stagamerica2 = stag-america-2
n1_prodeurope2 = prod-europe-2

df1.select(“ location1”)。distinct.show()

+----------------+
|    location1   |
+----------------+
|n3_testindia1   |
|n2_stagamerica2 |
|n1_prodeurope2  |

df2.select(“ loc1”)。distinct.show()

+--------------+
|      loc1    |
+--------------+
|test-india-1  |   
|stag-america-2|
|prod-europe-2 |
+--------------+

我想根据以下位置列加入

val joindf = df1.join(df2, df1("location1") == regex(df2("loc1")))

2 个答案:

答案 0 :(得分:0)

根据上述信息,您可以在Spark 2.4.0中使用

val joindf = df1.join(df2, 
  regexp_extract(df1("location1"), """[^_]+_(.*)""", 1) 
    === translate(df2("loc1"), "-", ""))

或者在以前的版本中,例如

val joindf = df1.join(df2, 
  df1("location1").substr(lit(4), length(df1("location1")))
    === translate(df2("loc1"), "-", ""))

答案 1 :(得分:0)

您可以在location1中将“ _”分割为2个元素,然后与loc1中已删除字符串“-”的整个字符串匹配。检查一下:

SELECT code_commune, nom, section_naf, 
       SUM(nombre_salaries) as nombre_salaries, sum(nombre_entreprises) as nombre_entreprises 
FROM activites_des_communes 
WHERE code_commune = :code_commune
GROUP BY code_commune, nom, section_naf 
ORDER by code_commune, nom, sum(nombre_salaries) DESC;