使用任意类型成员将DataFrame条目转换为Case类

时间:2018-05-11 01:00:49

标签: scala apache-spark spark-dataframe apache-spark-dataset

我在列中有各种类型的DataFrame。为了清楚起见,我们假设它的结构如下所示,列为Ints,列为Strings,列为Floats

+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
|      1|      a|    0.1|
|      2|      b|    0.2|
|      3|      c|    0.3|
+-------+-------+-------+

我试图将UDF应用于所有这三列,以便将每个条目更改为案例类,如下所示:

case class Annotation(lastUpdate: String, value: Any)

通过应用以下代码:

val columns = df.columns
val myUDF= udf { in: Any => Annotation("dummy", in) }
val finalDF = columns.foldLeft(df){ (tempDF, colName) =>
    tempDF.withColumn(colName, myUDF(col(colName)))
}

请注意,在第一遍中,我不关心Annotation.lastUpdate值是什么。但是,在尝试运行时,我收到以下错误:

Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type scala.Any is not supported
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:762)
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:704)
    at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
    at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
    at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
    at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:703)
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1$$anonfun$apply$6.apply(ScalaReflection.scala:758)
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1$$anonfun$apply$6.apply(ScalaReflection.scala:757)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.List.foreach(List.scala:381)

我一直在寻找解决此问题的自定义编码器,但我不确定如何在这种情况下应用一个。

0 个答案:

没有答案