如何使用UserDefinedFunction(UDF)中的非列值将列添加到DataFrame?

时间:2018-06-29 18:48:39

标签: scala apache-spark dataframe user-defined-functions

我有一个要操作的简单数据框:

+---+----+
| id|name|
+---+----+
|  1|   a|
|  2|   b|
|  3|   c|
|  4|   d|
|  5|   e|
+---+----+

我试图基于“ id”列和当我调用withColumn()时将通过的值(在这种情况下为字符串“ hey”)添加另一列。

根据其他StackOverflow帖子(Adding a new column to a Dataframe by using the values of multiple other columns in the dataframe - spark/scala),我应该能够使用UDF,UserDefinedFunctions,但是在使用下面的代码从IntelliJ进行的UDF调用上出现“不适用”错误

val table = Seq(("1", "a"), ("2", "b"), ("3", "c"), ("4", "d"), ("5", "e")).toDF("id", "name")

def newID(s: String, v: String): String = {
  s.concat("-" + v)
}

val newUDF = udf(newID _)

table.show()
val v = "hey"
val newO = table.withColumn("someOp", newUDF($"id", v)) // this works if I
// use the column "name" instead of the String v which looks
// like -> newUDF($"id", $"name")

newO.show()

所以,我可以得到:

+---+----+------+
| id|name|someOp|
+---+----+------+
|  1|   a|   1-a|
|  2|   b|   2-b|
|  3|   c|   3-c|
|  4|   d|   4-d|
|  5|   e|   5-e|
+---+----+------+

但不是:

+---+----+--------+
| id|name|  someOp|
+---+----+--------+
|  1|   a|   1-hey|
|  2|   b|   2-hey|
|  3|   c|   3-hey|
|  4|   d|   4-hey|
|  5|   e|   5-hey|
+---+----+--------+

0 个答案:

没有答案