使用正则表达式在Spark中联接两个数据框

时间:2020-09-23 18:28:14

标签: regex scala apache-spark

说我有一个数据框df1,其中“颜色”列包含一堆颜色,另一个数据框df2,其“短语”列包含各种短语。

我想加入两个数据框,其中d1中的颜色出现在d2中的短语中。我无法使用d1.join(d2, d2("phrases").contains(d1("color")),因为它会连接到词在短语中出现的任何位置。我不想匹配scaRED之类的单词,例如RED是另一个单词的一部分。我只想在颜色在短语中显示为单独的单词时加入。

我可以使用正则表达式解决此问题吗?需要引用表达式中的列时可以使用什么函数?语法如何?

3 个答案:

答案 0 :(得分:2)

没有看到您的数据,但这只是一个开始,略有变化。据我所知,不需要正则表达式,但谁知道:

// You need to do some parsing like stripping of . ? and may be lowercase or uppercase
// You did not provide an example on the JOIN

import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray

val checkValue = udf { (array: WrappedArray[String], value: String) => array.iterator.map(_.toLowerCase).contains(value.toLowerCase() ) }

//Gen some data
val dfCompare = spark.sparkContext.parallelize(Seq("red", "blue", "gold", "cherry")).toDF("color")
val rdd = sc.parallelize( Array( (("red","hello how are you red",10)), (("blue", "I am fine but blue",20)), (("cherry", "you need to do some parsing and I like cherry",30)), (("thebluephantom", "you need to do some parsing and I like fanta",30)) ))
//rdd.collect
val df = rdd.toDF()
val df2 = df.withColumn("_4", split($"_2", " ")) 
df2.show(false)
dfCompare.show(false)
val res = df2.join(dfCompare, checkValue(df2("_4"), dfCompare("color")), "inner")
res.show(false)

返回:

+------+---------------------------------------------+---+--------------------------------------------------------+------+
|_1    |_2                                           |_3 |_4                                                      |color |
+------+---------------------------------------------+---+--------------------------------------------------------+------+
|red   |hello how are you red                        |10 |[hello, how, are, you, red]                             |red   |
|blue  |I am fine but blue                           |20 |[I, am, fine, but, blue]                                |blue  |
|cherry|you need to do some parsing and I like cherry|30 |[you, need, to, do, some, parsing, and, I, like, cherry]|cherry|
+------+---------------------------------------------+---+--------------------------------------------------------+------+

答案 1 :(得分:2)

您可以创建一个REGEX模式,以匹配\b时检查单词边界(colors),并使用regexp_replace作为join条件:

val df1 = Seq(
  (1, "red"), (2, "green"), (3, "blue")
).toDF("id", "color")

val df2 = Seq(
  "red apple", "scared cat", "blue sky", "green hornet"
).toDF("phrase")

val patternCol = concat(lit("\\b"), df1("color"), lit("\\b"))

df1.join(df2, regexp_replace(df2("phrase"), patternCol, lit("")) =!= df2("phrase")).
  show
// +---+-----+------------+
// | id|color|      phrase|
// +---+-----+------------+
// |  1|  red|   red apple|
// |  3| blue|    blue sky|
// |  2|green|green hornet|
// +---+-----+------------+

请注意,如果没有封闭的单词边界,“吓死的猫”本来是匹配的。

答案 2 :(得分:2)

建立自己的解决方案,您也可以尝试以下方法:

d1.join(d2, array_contains(split(d2("phrases"), " "), d1("color")))