我需要在进行一些连接时编写一些正则表达式以进行状态检查
我的正则表达式应匹配字符串
n3_testindia1 = test-india-1
n2_stagamerica2 = stag-america-2
n1_prodeurope2 = prod-europe-2
df1.select(“ location1”)。distinct.show()
+----------------+
| location1 |
+----------------+
|n3_testindia1 |
|n2_stagamerica2 |
|n1_prodeurope2 |
df2.select(“ loc1”)。distinct.show()
+--------------+
| loc1 |
+--------------+
|test-india-1 |
|stag-america-2|
|prod-europe-2 |
+--------------+
我想根据以下位置列加入
val joindf = df1.join(df2, df1("location1") == regex(df2("loc1")))
答案 0 :(得分:0)
根据上述信息,您可以在Spark 2.4.0中使用
val joindf = df1.join(df2,
regexp_extract(df1("location1"), """[^_]+_(.*)""", 1)
=== translate(df2("loc1"), "-", ""))
或者在以前的版本中,例如
val joindf = df1.join(df2,
df1("location1").substr(lit(4), length(df1("location1")))
=== translate(df2("loc1"), "-", ""))
答案 1 :(得分:0)
您可以在location1中将“ _”分割为2个元素,然后与loc1中已删除字符串“-”的整个字符串匹配。检查一下:
SELECT code_commune, nom, section_naf,
SUM(nombre_salaries) as nombre_salaries, sum(nombre_entreprises) as nombre_entreprises
FROM activites_des_communes
WHERE code_commune = :code_commune
GROUP BY code_commune, nom, section_naf
ORDER by code_commune, nom, sum(nombre_salaries) DESC;