在Map [String,java.io.Serializable]

时间:2018-11-15 23:12:03

标签: apache-spark

我正在写一个火花作业,它的数据集非常灵活,定义为Dataset[Map[String, java.io.Serializable]]

现在问题开始出现,spark运行时抱怨No Encoder found for java.io.Serializable。我尝试过kyro serde,仍然显示相同的错误消息。

之所以必须使用这种怪异的数据集类型,是因为每行有灵活的字段。和地图看起来像:

Map(
  "a" -> 1,
  "b" -> "bbb",
  "c" -> 0.1,
  ...
)

Spark中是否有任何方法可以处理这种灵活的数据集类型?

编辑: 这是任何人都可以尝试的可靠代码。

import org.apache.spark.sql.{Dataset, SparkSession}

object SerdeTest extends App {
  val sparkSession: SparkSession = SparkSession
    .builder()
    .master("local[2]")
    .getOrCreate()


  import sparkSession.implicits._
  val ret: Dataset[Record] = sparkSession.sparkContext.parallelize(0 to 10)
    .map(
      t => {
        val row = (0 to t).map(
          i => i -> i.asInstanceOf[Integer]
        ).toMap

        Record(map = row)
      }
    ).toDS()

  val repartitioned = ret.repartition(10)


  repartitioned.collect.foreach(println)
}

case class Record (
                  map: Map[Int, java.io.Serializable]
                  )

上面的代码将给您错误找不到编码器:

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for java.io.Serializable
- map value class: "java.io.Serializable"
- field (class: "scala.collection.immutable.Map", name: "map")

1 个答案:

答案 0 :(得分:0)

找到了答案,解决此问题的一种方法是使用Kyro serde框架,代码更改非常少,只需要使用Kyro制作隐式Encoder,并在需要序列化时将其带入上下文即可。

这是我正在使用的代码示例(可以直接在IntelliJ或等效的IDE中运行):

import org.apache.spark.sql._

object SerdeTest extends App {
  val sparkSession: SparkSession = SparkSession
    .builder()
    .master("local[2]")
    .getOrCreate()


  import sparkSession.implicits._

  // here is the place you define your Encoder for your custom object type, like in this case Map[Int, java.io.Serializable]
  implicit val myObjEncoder: Encoder[Record] = org.apache.spark.sql.Encoders.kryo[Record]
  val ret: Dataset[Record] = sparkSession.sparkContext.parallelize(0 to 10)
    .map(
      t => {
        val row = (0 to t).map(
          i => i -> i.asInstanceOf[Integer]
        ).toMap

        Record(map = row)
      }
    ).toDS()

  val repartitioned = ret.repartition(10)


  repartitioned.collect.foreach(
    row => println(row.map)
  )
}

case class Record (
                  map: Map[Int, java.io.Serializable]
                  )

此代码将产生预期的结果:

Map(0 -> 0, 5 -> 5, 1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4)
Map(0 -> 0, 1 -> 1, 2 -> 2)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 2 -> 2, 7 -> 7, 3 -> 3, 4 -> 4)
Map(0 -> 0, 1 -> 1)
Map(0 -> 0, 1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4)
Map(0 -> 0, 1 -> 1, 2 -> 2, 3 -> 3)
Map(0 -> 0)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 2 -> 2, 3 -> 3, 4 -> 4)
Map(0 -> 0, 5 -> 5, 10 -> 10, 1 -> 1, 6 -> 6, 9 -> 9, 2 -> 2, 7 -> 7, 3 -> 3, 8 -> 8, 4 -> 4)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 9 -> 9, 2 -> 2, 7 -> 7, 3 -> 3, 8 -> 8, 4 -> 4)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 2 -> 2, 7 -> 7, 3 -> 3, 8 -> 8, 4 -> 4)