Question

我需要实现一个spark.sql.functions.udf来做一些复杂的过滤。

我有一些例子，但大多数都很简单，并且实现为闭包。虽然在闭包中控制返回值并不简单。

这是一个例子：

val filterClosure: UserDefinedFunction = udf {
(ips: mutable.WrappedArray[String]) =>
  for (ip <- ips) {
    if (!(ip.startsWith("abc") || ip.startsWith("def"))) true
  }
  false
}

val ds = Seq((0, Array("hello", "baby", "word")), (1, Array("abcgod", "deftest"))).toDF("id", "words")
ds.filter(filterClosure($"words")).show()

输出结果为：

+---+-----+
| id|words|
+---+-----+
+---+-----+

那么，如何将其作为一个函数实现呢？

Answer 1

你的代码问题在你的函数中：它总是返回false，因为false是最后一个语句。所以你的循环什么都不做。您在循环中使用函数文字filterFunction和return解决了此问题。但是不建议在Scala中使用return，并且有很多方法可以与集合进行交互。那么为什么不使用exists方法？

val ds = Seq((0, Array("hello", "baby", "word")), (1, Array("abcgod", "deftest"))).toDF("id", "words")
val filterClosure = udf {
    (ips: scala.collection.mutable.WrappedArray[String]) => ips.exists(ip => !(ip.startsWith("abc") || ip.startsWith("def")))
}

ds.filter(filterClosure($"words")).show()
+---+-------------------+
| id|              words|
+---+-------------------+
|  0|[hello, baby, word]|
+---+-------------------+

那是结果。

强烈建议不要编写自己的方法而不是Scala Collections API中包含的方法。

Answer 2

使用udf包装已定义的函数：

def filterFunction(words: mutable.WrappedArray[String]): Boolean = {
    for (wd <- words) {
        if (!(wd.startsWith("abc") || wd.startsWith("def"))) return true
      }
    false
}
val filterUdf = udf(filterFunction _)
ds.filter(filterUdf($"words")).show()

输出正确：

+---+-------------------+
| id|              words|
+---+-------------------+
|  0|[hello, baby, word]|
+---+-------------------+

如何实现spark sql udf作为函数？

2 个答案: