Question

我有这样的数据集：

std::isinf()

我想编写一个依赖于两列的UDF。

我按照this answer中的第二种方法开始工作，即在UDF之外处理+----+------+ |code|status| +-----------+ | 1| "new"| | 2| null| | 3| null| +----+------+，并编写null以获取布尔值作为第二个参数：

myFn

要在UDF中处理null，我查看的方法是this answer，它讨论了用df.withColumn("new_column", when(pst_regs("status").isNull, myFnUdf($"code", lit(false)) ) .otherwise( myFnUdf($"code", lit(true)) ) )＆＃34;包装参数。我尝试过这样的代码：

Options

但df.withColumn("new_column", myFnUdf($"code", $"status")) def myFn(code: Int, status: String) = (code, Option(status)) match { case (1, "new") => "1_with_new_status" case (2, Some(_)) => "2_with_any_status" case (3, None) => "3_no_status" }行有null。我也尝试在创建udf期间用type mismatch; found :None.type required String包装一个参数但没有成功。这个的基本形式（没有Option）看起来像这样：

Option

我是Scala的新手，所以我确定我错过了一些简单的事情。我的一些困惑可能是从函数创建udfs的不同语法（例如每https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html），所以我不确定我是否使用最佳方法。任何帮助表示赞赏！

修改

根据@ user6910411和@sgvd评论编辑添加丢失的myFnUdf = udf[String, Int, String](myFn(_:Int, _:String))个案例。

Answer 1

首先，您可能会使用我们在此处缺少的一些代码。当我尝试将您的示例myFn与val myFnUdf = udf(myFn _)制作成UDF并使用df.withColumn("new_column", myFnUdf($"code", $"status")).show运行时，我不会遇到类型不匹配，而是MatchError，如同用户也注意到user6910411。这是因为没有匹配(1, "new")的模式。

除此之外，尽管使用Scala的选项而不是原始null值通常更好，但在这种情况下您不必这样做。以下示例直接与null一起使用：

val my_udf = udf((code: Int, status: String) => status match {
    case null => "no status"
    case _ => "with status"
})

df.withColumn("new_column", my_udf($"code", $"status")).show

结果：

+----+------+-----------+
|code|status| new_column|
+----+------+-----------+
|   1|   new|with status|
|   2|  null|  no status|
|   2|  null|  no status|
+----+------+-----------+

使用选项包装仍然有效：

val my_udf = udf((code: Int, status: String) => Option(status) match {
    case None => "no status"
    case Some(_) => "with status"
})

这给出了相同的结果。

如何在Spark UDF中使用Option

1 个答案: