考虑以下Spark 2.1代码:
val df = Seq("Raphael").toDF("name")
df.show()
+-------+
| name|
+-------+
|Raphael|
+-------+
val squareUDF = udf((d:Double) => Math.pow(d,2))
df.select(squareUDF($"name")).show
+---------+
|UDF(name)|
+---------+
| null|
+---------+
为什么我会null
?我期待像ClassCastException
之类的东西,因为我试图在Scala Double上映射一个字符串
"Raphael".asInstanceOf[Double]
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
答案 0 :(得分:3)
很容易弄清楚你是否检查了执行计划:
scala> df.select(squareUDF($"name")).explain(true)
== Parsed Logical Plan ==
'Project [UDF('name) AS UDF(name)#51]
+- AnalysisBarrier Project [value#36 AS name#38]
== Analyzed Logical Plan ==
UDF(name): double
Project [if (isnull(cast(name#38 as double))) null else UDF(cast(name#38 as double)) AS UDF(name)#51]
+- Project [value#36 AS name#38]
+- LocalRelation [value#36]
== Optimized Logical Plan ==
LocalRelation [UDF(name)#51]
== Physical Plan ==
LocalTableScan [UDF(name)#51]
正如您所见,Spark在应用UDF之前执行类型转换:
UDF(cast(name#38 as double))
和SQL强制转换不会为类型兼容的强制转换抛出异常。如果无法进行实际演员,则该值未定义(NULL
)。如果类型不兼容:
Seq((1, ("Raphael", 42))).toDF("id", "name_number").select(squareUDF($"name_number"))
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(name_number)' due to data type mismatch: argument 1 requires double type, however, '`name_number`' is of struct<_1:string,_2:int> type.;;
//
// at org.apache...
你会得到一个例外。
如果类型不兼容
其余的内容包括:
if (isnull(cast(name#38 as double))) null
由于值为null
,因此永远不会调用udf。