Question

我有一个包含两列（words，numbers）的数据框，因此在单词下我有一个字符串数组和数字，我有一个整数数组。

例如：

words: ["hello","there","Everyone"] and numbers: [0,4,5]

我希望能够得到数字中的整数不为0的单词。因此，在上述情况下，仅应返回“ there”和“ Everyone”。

我还是scala和spark的初学者，因此尝试了filter，但是如何进入数组呢？我该如何退还这些字呢？

like df.filter(col("numbers") != 0)

Answer 1

您可以简单地定义以下UDF：

val myUDF = udf { (a : Seq[String], b : Seq[Int]) => 
  a.zip(b).filter(_._2 != 0).map(_._1) 
}

它根据整数值将数组和过滤器压缩在一起。

df.select(myUDF($"words", $"numbers").as("words")).show

返回数组中的相应单词

+-----------------+
|            words|
+-----------------+
|[there, everyone]|
+-----------------+

如果您希望将每个单词放在单独的行中，则可以使用explode：

df.select(explode(myUDF($"words", $"numbers")).as("words")).show

这将导致

+--------+
|   words|
+--------+
|   there|
|everyone|
+--------+