火花不检查UDF /列类型?

时间:2018-02-12 15:48:15

标签: scala apache-spark

考虑以下Spark 2.1代码:

val df = Seq("Raphael").toDF("name")
df.show()    

+-------+
|   name|
+-------+
|Raphael|
+-------+

val squareUDF = udf((d:Double) => Math.pow(d,2))

df.select(squareUDF($"name")).show

+---------+
|UDF(name)|
+---------+
|     null|
+---------+

为什么我会null?我期待像ClassCastException之类的东西,因为我试图在Scala Double上映射一个字符串

"Raphael".asInstanceOf[Double]

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double

1 个答案:

答案 0 :(得分:3)

很容易弄清楚你是否检查了执行计划:

scala> df.select(squareUDF($"name")).explain(true)
== Parsed Logical Plan ==
'Project [UDF('name) AS UDF(name)#51]
+- AnalysisBarrier Project [value#36 AS name#38]

== Analyzed Logical Plan ==
UDF(name): double
Project [if (isnull(cast(name#38 as double))) null else UDF(cast(name#38 as double)) AS UDF(name)#51]
+- Project [value#36 AS name#38]
   +- LocalRelation [value#36]

== Optimized Logical Plan ==
LocalRelation [UDF(name)#51]

== Physical Plan ==
LocalTableScan [UDF(name)#51]

正如您所见,Spark在应用UDF之前执行类型转换:

UDF(cast(name#38 as double))

和SQL强制转换不会为类型兼容的强制转换抛出异常。如果无法进行实际演员,则该值未定义(NULL)。如果类型不兼容:

Seq((1, ("Raphael", 42))).toDF("id", "name_number").select(squareUDF($"name_number"))
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(name_number)' due to data type mismatch: argument 1 requires double type, however, '`name_number`' is of struct<_1:string,_2:int> type.;;
// 
// at org.apache...

你会得到一个例外。

如果类型不兼容

其余的内容包括:

if (isnull(cast(name#38 as double))) null 

由于值为null,因此永远不会调用udf。