火花数据帧到密封特征类型

时间:2018-06-19 10:48:06

标签: scala apache-spark apache-spark-sql

我将一些数据存储为镶木地板文件和与数据模式匹配的案例类。 Spark可以很好地处理常规产品类型,所以如果我有

case class A(s:String, i:Int)

我可以轻松做到

spark.read.parquet(file).as[A]

但据我所知,Spark不会处理析取类型,因此当我在镶木地板中使用枚举时,先前编码为整数,以及像

这样的scala表示
sealed trait E
case object A extends E
case object B extends E

我无法做到

spark.read.parquet(file).as[E]
// java.lang.UnsupportedOperationException: No Encoder found for E

到目前为止有道理,但是,可能太天真了,我试试

implicit val eEncoder = new org.apache.spark.sql.Encoder[E] {
 def clsTag = ClassTag(classOf[E])
 def schema = StructType(StructField("e", IntegerType, nullable = false)::Nil)
}

我仍然得到“为E找不到编码器”:(

我现在的问题是,为什么范围内的隐含缺失? (或者不被识别为编码器[E]),即使它确实如此,这样的接口如何允许我实际解码数据?我仍然需要将值映射到正确的case对象。

我读过a related answer,上面写着“TL; DR现在没有好的解决方案,而且鉴于Spark SQL / Dataset的实施,在可预见的未来不太可能存在。”但我很难理解为什么自定义编码器无法做到这一点。

1 个答案:

答案 0 :(得分:2)

But I'm struggling to understand why a custom Encoder couldn't do the trick.

Two main reasons:

  • There is no API for custom Encoders. Publicly available are only "binary" Kryo and Java Encoders, which create useless (in case of DataFrame / Dataset[Row]) blobs with no support for any meaningful SQL / DataFrame operations.

    Code like this would work fine

    import org.apache.spark.sql.Encoders
    
    spark.createDataset(Seq(A, B): Seq[E])(Encoders.kryo[E])
    

    but it is nothing more than a curiosity.

  • DataFrame is a columnar store. It is technically possible to encode type hierarchies on top of this structure (private UserDefinedType API does that) but it is cumbersome (as you have to provide storage for all possible variants, see for example How to define schema for custom type in Spark SQL?) and inefficient (in general complex types are somewhat second class citizens in Spark SQL, and many optimizations are not accessible with complex schema, subject to future changes).

    In broader sense DataFrame API is effectively relational (as in relational algebra) and tuples (main building block of relations) are by definition homogeneous, so by extension there is no place in SQL / DataFrame API, for heterogeneous structures.