Spark不可序列化的问题

时间:2017-02-13 17:04:27

标签: scala apache-spark serialization dependency-injection

我正在努力重构我们的代码,以便我们可以使用CAKE模式进行DI。

我偶然发现了一个序列化问题,我很难理解。

当我调用此函数时:

def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
  val accounts = getAccounts //call to db

  val result = accounts.map(row =>
    Some(AccountDetails(UUID.fromString(row.getAs[String]("")),
      row.getAs[String](""),
      UUID.fromString(row.getAs[String]("")),
      row.getAs[String](""),
      row.getAs[String](""),
      DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones))))
    .map(m=>m.get)
  result
}

它工作得很完美,但这很难看,我想重构它,以便从行到AccountDetails的中间映射放在私有函数中 - 但是这样做会导致序列化问题。

我想:

def getAccounts(winZones: Broadcast[List[WindowsZones]]): RDD[AccountDetails] = {
  val accounts = getAccounts 

  val result = accounts
    .map(m => getAccountDetails(m, winZones))
    .filter(_.isDefined)
    .map(m => m.get)
  result
}

private def getAccountDetails(row: Row, winZones: Broadcast[List[WindowsZones]]): Option[AccountDetails] = {
      try {
        Some(AccountDetails(UUID.fromString(""),
          row.getAs[String](""),
          UUID.fromString(row.getAs[String]("")),
          row.getAs[String](""),
          row.getAs[String](""),
          DateUtils.getIanaZoneFromWinZone(row.getAs[String]("timeZone"), winZones)))
      }
      catch {
        case e: Exception =>
          logger.error(s"Unable to set AccountDetails $e")
          None
      }
    }

当然,任何帮助都是值得赞赏的,AccountDetails obj是一个与之相关的案例类。也很高兴采取任何其他建议实施蛋糕或DI与火花一般。感谢。

编辑以显示结构:

trait serviceImpl extends anotherComponent {this: DBA =>
  def Accounts = new Accounts
  class Accounts extends AccountService {
    //the methods above are defined here.
  }

编辑以包含stacktrace:

    17/02/13 17:32:32 INFO CodeGenerator: Code generated in 271.36617 ms
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.RDD.map(RDD.scala:365)
    at FunnelServiceComponentImpl$FunnelAccounts.getAccounts(FunnelServiceComponentImpl.scala:24)
    at Main$.delayedEndpoint$Main$1(Main.scala:26)
    at Main$delayedInit$body.apply(Main.scala:7)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at Main$.main(Main.scala:7)
    at Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: FunnelServiceComponentImpl$FunnelAccounts
Serialization stack:
    - object not serializable (class: FunnelServiceComponentImpl$FunnelAccounts, value: FunnelServiceComponentImpl$FunnelAccounts@16b7e04a)
    - field (class: FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, name: $outer, type: class FunnelServiceComponentImpl$FunnelAccounts)
    - object (class FunnelServiceComponentImpl$FunnelAccounts$$anonfun$1, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
    ... 26 more
17/02/13 17:32:32 INFO SparkContext: Invoking stop() from shutdown hook

2 个答案:

答案 0 :(得分:2)

您在哪里定义功能?

假设您在X类中定义它们。如果该类不可序列化,则会导致您的问题。

要解决此问题,您可以将其改为对象,也可以使类可序列化。

答案 1 :(得分:1)

由于getAccountDetails在您的班级,因此Spark会想要序列化您的整个FunnelAccounts对象。毕竟,您需要一个实例才能使用此方法。但是,FunnelAccounts不可序列化。因此,它不能被发送给工人。

在您的情况下,您应该将getAccountDetails移到FunnelAccounts对象中,这样您就不需要实例FunnelAccounts来运行它。