Spark UDF如何将Map转换为列

时间:2016-11-14 22:27:12

标签: scala apache-spark spark-dataframe apache-zeppelin

我正在使用Apache Zeppelin笔记本。所以spark基本上是以交互模式运行的。我不能在这里使用闭包变量,因为它试图序列化整个段落(更大的闭包)时抛出org.apache.spark.SparkException: Task not serializable

因此,如果没有闭包方法,我只有将map作为列传递给UDF。

我从paried RDD收集了以下地图:

final val idxMap = idxMapRdd.collectAsMap

这里有一个火花转换:

def labelStr(predictions: WrappedArray[Double], idxMap: Map[Double, String]): Array[String] = {

    predictions.array.map(idxMap.getOrElse(_, "Other"))
}
@transient val predictionStrUDF = udf { (predictions: WrappedArray[Double], idxMap: Map[Double, String]) =>  labelStr(predictions)}

val cvmlPredictionsStr = cvmlPrediction.withColumn("predictionsStr", predictionStrUDF(col("predictions"), lit(idxMap))) 

但是使用lit(idxMap)语句我得到了以下错误:

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.HashMap$HashTrieMap

所以我尝试使用以下命令创建列:

val colmap = map(idxMapArr.map(lit _): _*)

但是得到以下错误:

<console>:139: error: type mismatch;
 found   : Iterable[org.apache.spark.sql.Column]
 required: Seq[org.apache.spark.sql.Column]
       val colmap =  map(idxMapArr.map(lit _): _*)

关闭方法(完整性):

def predictionStrUDF2( idxMapArr: scala.collection.Map[Double,String]) = {
     udf((predictions: WrappedArray[Double] ) =>  labelStr(predictions, idxMapArr))
}
val cvmlPredictionsStr = cvmlPrediction.withColumn("predictionsStr", predictionStrUDF2(idxMapArr)(col("predictions")))

编译但是当我cvmlPredictionsStr.show时,我得到了关注。我认为这是由于zeppelin的交互性

org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:798)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:797)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:797)
  at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:364)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:323)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
  ... 62 elided
Caused by: java.io.NotSerializableException: com.github.fommil.netlib.F2jBLAS
Serialization stack:
    - object not serializable (class: com.github.fommil.netlib.F2jBLAS, value: com.github.fommil.netlib.F2jBLAS@294770d3)
    - field (class: org.apache.spark.ml.tuning.CrossValidator, name: f2jBLAS, type: class com.github.fommil.netlib.F2jBLAS)
    - object (class org.apache.spark.ml.tuning.CrossValidator, cv_891fd6b7d95f)
    - field (class: $iw, name: crossValidator, type: class org.apache.spark.ml.tuning.CrossValidator)
    - object (class $iw, $iw@556a6aed)
    - field (class: $iw, name: $iw, type: class $iw)

1 个答案:

答案 0 :(得分:1)

问题标题是关于Spark UDF的,但似乎真正的问题是如何避免某些交互式环境出现的闭包序列化问题。

根据您对问题的描述,如果直接在您的某个笔记本单元格中执行,则听起来以下情况不起作用:

val x = 5
sc.parallelize(1 to 10).filter(_ > x).collect()

这可能是因为 x 是单元格对象的类成员;当lambda捕获 x 时,它会尝试序列化整个单元格对象。单元格对象不可序列化,结果是一个混乱的异常。使用包装器对象可以避免此问题。请注意,这可能是一种声明这个包装器的方式(也许只是嵌套在大括号内就足够了)。

object Wrapper {
    def f() {
        val x = 5
        sc.parallelize(1 to 10).filter(_ > x).collect()
    }
}
Wrapper.f()

解决此问题后您可能仍有疑问,但目前问题涉及太多不同的子主题。有关闭包序列化问题的另一种解释是here