根据另一个Dataframe从一个Dataframe中的Sequence中删除String

时间:2018-04-12 03:12:12

标签: apache-spark apache-spark-sql spark-dataframe

我有一个如下所示的数据框

stringTokenDF

+-------+------------------
|Id     | Tokens
+-------+------------------
|1      |[A, B, C, D]
|1      |[B, C, D, G]                                                 
|1      |[A, D, E]                                                     
|1      |[B, C, F]                                                
|2      |[A, C, D]
|2      |[C, E, F]
|2      |[A, C, D, H]
+-------+------------------

另一个数据框如下所示

leastFrequenctDf

+-------+------------------
|Id     | LeastFrequentWords
+-------+------------------
|1      |[E, G]
|2      |[E, F, H]

现在我想从stringTokenDF的Sequence in Tokens列中删除字符串,该字符串属于同一个id的LeastFrequenctWords Sequence。我的输出应如下所示。

+-------+------------------
|Id     | Tokens
+-------+------------------
|1      |[A, B, C, D]
|1      |[B, C, D]                                                 
|1      |[A, D ]                                                     
|1      |[B, C, F]                                                
|2      |[A, C, D]
|2      |[C ]
|2      |[A, C, D]
+-------+------------------

我尝试使用序列的连接和交叉,但它没有给我正确的结果。

val  intersectorUDF = udf((seq1: Seq[String], seq2: Seq[String]) => {
    seq1.intersect(seq2)
} )

stringTokenDF.join(leastFrequenctDf, stringTokenDF("id") === leastFrequenctDf("id")).
withColumn("intersectedToken",intersectorUDF(stringTokenDF("Tokens"),
leastFrequenctDf("LeastFrequentWords"))

在spark scala中实现这一目标的正确方法是什么?

1 个答案:

答案 0 :(得分:2)

您可以加入两个DataFrame并应用UDF来计算两个序列列之间的diff

val stringTokenDF = Seq(
  (1, Seq("A", "B", "C", "D")),
  (1, Seq("B", "C", "D", "G")),
  (1, Seq("A", "D", "E")),
  (1, Seq("B", "C", "F")),
  (2, Seq("A", "C", "D")),
  (2, Seq("C", "E", "F")),
  (2, Seq("A", "C", "D", "H"))
).toDF("Id", "Tokens")

val leastFrequenctDf = Seq(
  (1, Seq("E", "G")),
  (2, Seq("E", "F", "H"))
).toDF("Id", "LeastFrequentWords")

def diff = udf( (s1: Seq[String], s2: Seq[String]) =>
  s1 diff s2
)

stringTokenDF.join(leastFrequenctDf, Seq("Id")).
  select($"Id", diff($"Tokens", $"LeastFrequentWords").as("Tokens")).
  show

// +---+------------+
// | Id|      Tokens|
// +---+------------+
// |  1|[A, B, C, D]|
// |  1|   [B, C, D]|
// |  1|      [A, D]|
// |  1|   [B, C, F]|
// |  2|   [A, C, D]|
// |  2|         [C]|
// |  2|   [A, C, D]|
// +---+------------+