Question

我需要在进行一些连接时编写一些正则表达式以进行状态检查

我的正则表达式应匹配字符串

n3_testindia1 = test-india-1
n2_stagamerica2 = stag-america-2
n1_prodeurope2 = prod-europe-2

df1.select（“ location1”）。distinct.show（）

+----------------+
|    location1   |
+----------------+
|n3_testindia1   |
|n2_stagamerica2 |
|n1_prodeurope2  |

df2.select（“ loc1”）。distinct.show（）

+--------------+
|      loc1    |
+--------------+
|test-india-1  |   
|stag-america-2|
|prod-europe-2 |
+--------------+

我想根据以下位置列加入

val joindf = df1.join(df2, df1("location1") == regex(df2("loc1")))

Answer 1

根据上述信息，您可以在Spark 2.4.0中使用

val joindf = df1.join(df2, 
  regexp_extract(df1("location1"), """[^_]+_(.*)""", 1) 
    === translate(df2("loc1"), "-", ""))

或者在以前的版本中，例如

val joindf = df1.join(df2, 
  df1("location1").substr(lit(4), length(df1("location1")))
    === translate(df2("loc1"), "-", ""))

Answer 2

您可以在location1中将“ _”分割为2个元素，然后与loc1中已删除字符串“-”的整个字符串匹配。检查一下：

SELECT code_commune, nom, section_naf, 
       SUM(nombre_salaries) as nombre_salaries, sum(nombre_entreprises) as nombre_entreprises 
FROM activites_des_communes 
WHERE code_commune = :code_commune
GROUP BY code_commune, nom, section_naf 
ORDER by code_commune, nom, sum(nombre_salaries) DESC;

加入数据框时触发正则表达式

2 个答案: