我有一个如下所示的数据框
stringTokenDF
+-------+------------------
|Id | Tokens
+-------+------------------
|1 |[A, B, C, D]
|1 |[B, C, D, G]
|1 |[A, D, E]
|1 |[B, C, F]
|2 |[A, C, D]
|2 |[C, E, F]
|2 |[A, C, D, H]
+-------+------------------
另一个数据框如下所示
leastFrequenctDf
+-------+------------------
|Id | LeastFrequentWords
+-------+------------------
|1 |[E, G]
|2 |[E, F, H]
现在我想从stringTokenDF的Sequence in Tokens列中删除字符串,该字符串属于同一个id的LeastFrequenctWords Sequence。我的输出应如下所示。
+-------+------------------
|Id | Tokens
+-------+------------------
|1 |[A, B, C, D]
|1 |[B, C, D]
|1 |[A, D ]
|1 |[B, C, F]
|2 |[A, C, D]
|2 |[C ]
|2 |[A, C, D]
+-------+------------------
我尝试使用序列的连接和交叉,但它没有给我正确的结果。
val intersectorUDF = udf((seq1: Seq[String], seq2: Seq[String]) => {
seq1.intersect(seq2)
} )
stringTokenDF.join(leastFrequenctDf, stringTokenDF("id") === leastFrequenctDf("id")).
withColumn("intersectedToken",intersectorUDF(stringTokenDF("Tokens"),
leastFrequenctDf("LeastFrequentWords"))
在spark scala中实现这一目标的正确方法是什么?
答案 0 :(得分:2)
您可以加入两个DataFrame并应用UDF
来计算两个序列列之间的diff
:
val stringTokenDF = Seq(
(1, Seq("A", "B", "C", "D")),
(1, Seq("B", "C", "D", "G")),
(1, Seq("A", "D", "E")),
(1, Seq("B", "C", "F")),
(2, Seq("A", "C", "D")),
(2, Seq("C", "E", "F")),
(2, Seq("A", "C", "D", "H"))
).toDF("Id", "Tokens")
val leastFrequenctDf = Seq(
(1, Seq("E", "G")),
(2, Seq("E", "F", "H"))
).toDF("Id", "LeastFrequentWords")
def diff = udf( (s1: Seq[String], s2: Seq[String]) =>
s1 diff s2
)
stringTokenDF.join(leastFrequenctDf, Seq("Id")).
select($"Id", diff($"Tokens", $"LeastFrequentWords").as("Tokens")).
show
// +---+------------+
// | Id| Tokens|
// +---+------------+
// | 1|[A, B, C, D]|
// | 1| [B, C, D]|
// | 1| [A, D]|
// | 1| [B, C, F]|
// | 2| [A, C, D]|
// | 2| [C]|
// | 2| [A, C, D]|
// +---+------------+