我试图将整行以及其他一些参数传递给spark udf,我没有使用spark sql,而是使用 dataframe withColumn api ,但是却收到以下异常:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) col3#9 missing from col1#7,col2#8,col3#13 in operator !Project [col1#7, col2#8, col3#13, UDF(col3#9, col2, named_struct(col1, col1#7, col2, col2#8, col3, col3#9)) AS contcatenated#17]. Attribute(s) with the same name appear in the operation: col3. Please check if the right attribute(s) are used.;;
可以使用以下代码复制以上异常:
addRowUDF() // call invokes
def addRowUDF() {
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config(new SparkConf().set("master", "local[*]")).appName(this.getClass.getSimpleName).getOrCreate()
import spark.implicits._
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")).toDF("col1", "col2", "col3")
execute(df)
}
def execute(df: org.apache.spark.sql.DataFrame) {
import org.apache.spark.sql.Row
def concatFunc(x: Any, y: String, row: Row) = x.toString + ":" + y + ":" + row.mkString(", ")
import org.apache.spark.sql.functions.{ udf, struct }
val combineUdf = udf((x: Any, y: String, row: Row) => concatFunc(x, y, row))
def udf_execute(udf: String, args: org.apache.spark.sql.Column*) = (combineUdf)(args: _*)
val columns = df.columns.map(df(_))
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
val df3 = df2.withColumn("contcatenated", udf_execute("uudf", df2.col("col3"), lit("col2"), struct(columns: _*)))
df3.show(false)
}
输出应为:
+----+----+-----------+----------------------------+
|col1|col2|col3 |contcatenated |
+----+----+-----------+----------------------------+
|a |b |xxxxxxxxxxx|xxxxxxxxxxx:col2:a, b, c |
|a1 |b1 |xxxxxxxxxxx|xxxxxxxxxxx:col2:a1, b1, c1 |
+----+----+-----------+----------------------------+
答案 0 :(得分:1)
之所以会发生这种情况,是因为您引用了不在范围内的列。致电时:
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
它将原始col3
列设为阴影,从而有效地访问了具有相同名称的前面的列。即使不是这种情况,也可以在下面说:
val df2 = df.select($"*", lit("xxxxxxxxxxx") as "col3")
新的col3
将是模棱两可的,并且在名称上与*
所定义的名称没有区别。
因此,要获得所需的输出,您必须使用另一个名称:
val df2 = df.withColumn("col3_", lit("xxxxxxxxxxx"))
,然后相应地调整其余代码:
df2.withColumn(
"contcatenated",
udf_execute("uudf", df2.col("col3_") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")
如果逻辑与示例中的逻辑一样简单,那么您当然可以内联:
df.withColumn(
"contcatenated",
udf_execute("uudf", lit("xxxxxxxxxxx") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")