为什么数据框选择中的udf调用不起作用?

时间:2018-11-06 07:38:08

标签: scala apache-spark

我有一个示例数据框,如下所示:

val df = Seq((Seq("abc", "cde"), 19, "red, abc"), (Seq("eefg", "efa", "efb"), 192, "efg, efz efz")).toDF("names", "age", "color")

和以下用户定义的函数将字符串中df中的“ color”列替换为字符串长度:

def strLength(inputString: String): Long = inputString.size.toLong

我将udf参考保存为以下性能:

val strLengthUdf = udf(strLength _)

当我尝试执行SELECT时处理udf时,如果没有其他任何列名,它就会起作用:

val x = df.select(strLengthUdf(df("color")))

scala> x.show
+----------+
|UDF(color)|
+----------+
|         8|
|        12|
+----------+

但是当我要选择其他列以及udf处理的列时,会出现以下错误:

scala> val x = df.select("age", strLengthUdf(df("color")))
<console>:27: error: overloaded method value select with alternatives:
  [U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
  (col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)
       val x = df.select("age", strLengthUdf(df("color")))
                  ^

val x = df.select("age", strLengthUdf(df("color")))在这里我想念什么?

1 个答案:

答案 0 :(得分:4)

您不能在选择语句中混合使用字符串和列。

这将起作用:

df.select(df("age"), strLengthUdf(df("color")))