我正在写一个火花作业,它的数据集非常灵活,定义为Dataset[Map[String, java.io.Serializable]]
。
现在问题开始出现,spark运行时抱怨No Encoder found for java.io.Serializable
。我尝试过kyro serde,仍然显示相同的错误消息。
之所以必须使用这种怪异的数据集类型,是因为每行有灵活的字段。和地图看起来像:
Map(
"a" -> 1,
"b" -> "bbb",
"c" -> 0.1,
...
)
Spark中是否有任何方法可以处理这种灵活的数据集类型?
编辑: 这是任何人都可以尝试的可靠代码。
import org.apache.spark.sql.{Dataset, SparkSession}
object SerdeTest extends App {
val sparkSession: SparkSession = SparkSession
.builder()
.master("local[2]")
.getOrCreate()
import sparkSession.implicits._
val ret: Dataset[Record] = sparkSession.sparkContext.parallelize(0 to 10)
.map(
t => {
val row = (0 to t).map(
i => i -> i.asInstanceOf[Integer]
).toMap
Record(map = row)
}
).toDS()
val repartitioned = ret.repartition(10)
repartitioned.collect.foreach(println)
}
case class Record (
map: Map[Int, java.io.Serializable]
)
上面的代码将给您错误找不到编码器:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for java.io.Serializable
- map value class: "java.io.Serializable"
- field (class: "scala.collection.immutable.Map", name: "map")
答案 0 :(得分:0)
找到了答案,解决此问题的一种方法是使用Kyro serde框架,代码更改非常少,只需要使用Kyro制作隐式Encoder,并在需要序列化时将其带入上下文即可。
这是我正在使用的代码示例(可以直接在IntelliJ或等效的IDE中运行):
import org.apache.spark.sql._
object SerdeTest extends App {
val sparkSession: SparkSession = SparkSession
.builder()
.master("local[2]")
.getOrCreate()
import sparkSession.implicits._
// here is the place you define your Encoder for your custom object type, like in this case Map[Int, java.io.Serializable]
implicit val myObjEncoder: Encoder[Record] = org.apache.spark.sql.Encoders.kryo[Record]
val ret: Dataset[Record] = sparkSession.sparkContext.parallelize(0 to 10)
.map(
t => {
val row = (0 to t).map(
i => i -> i.asInstanceOf[Integer]
).toMap
Record(map = row)
}
).toDS()
val repartitioned = ret.repartition(10)
repartitioned.collect.foreach(
row => println(row.map)
)
}
case class Record (
map: Map[Int, java.io.Serializable]
)
此代码将产生预期的结果:
Map(0 -> 0, 5 -> 5, 1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4)
Map(0 -> 0, 1 -> 1, 2 -> 2)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 2 -> 2, 7 -> 7, 3 -> 3, 4 -> 4)
Map(0 -> 0, 1 -> 1)
Map(0 -> 0, 1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4)
Map(0 -> 0, 1 -> 1, 2 -> 2, 3 -> 3)
Map(0 -> 0)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 2 -> 2, 3 -> 3, 4 -> 4)
Map(0 -> 0, 5 -> 5, 10 -> 10, 1 -> 1, 6 -> 6, 9 -> 9, 2 -> 2, 7 -> 7, 3 -> 3, 8 -> 8, 4 -> 4)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 9 -> 9, 2 -> 2, 7 -> 7, 3 -> 3, 8 -> 8, 4 -> 4)
Map(0 -> 0, 5 -> 5, 1 -> 1, 6 -> 6, 2 -> 2, 7 -> 7, 3 -> 3, 8 -> 8, 4 -> 4)