Question

我需要在Spark中使用数据集，它可以保存具有某些已知属性的实体，但也包含未知的编译时属性列表。我需要一种简单的方法来通过计算管道传递这些可选属性，而不是打扰它们。

编写代码示例：

> loaded.show
root
 |-- key: long (nullable = true)
 |-- value: string (nullable = true)
 |-- opt1: string (nullable = true)
 |-- opt2: long (nullable = true)

让我们想象一下我在编译时所知道和关心的关键和价值。

case class BusinessEntity(key: Long, value: String) {
  def businessLogic = this
}

如果我将数据帧转换为类型化数据集，则额外属性显然会丢失。

loaded.as[BusinessEntity].map(_.businessLogic).toDF.printSchema
root
 |-- key: long (nullable = true)
 |-- value: string (nullable = true)

我需要做的是将它们存储在实体中的某个位置，以便在计算管道的最末端（可以包含连接等），我能够将它们提取到目标存储。

我可以想象使用以下一些方法来存储可选数据

case class BusinessEntity(key: String, value: String, extra: Row)
dataset.select("key", "value", "row.*")

case class BusinessEntity(key: String, value: String, extra: Map[String, AnyVal])
dataset.select($"key", $"value", 
/* Generate at runtime from attr list */ 
$"extra"("opt1").cast("long").as("opt2"), 
$"extra"("opt2").cast("long").as("opt2"))

case class BusinessEntity(key: String, value: String, extra: List[AnyVal])
dataset.select($"key", $"value", 
/* Generate at runtime from attr list */
$"extra"(0).cast("long").as("opt1"), 
$"extra"(1).cast("long").as("opt2"))

但它们都不会起作用，因为Spark无法为Row / Map [？，AnyVal] / List [AnyVal]生成编码器。到目前为止，我只能将可选属性存储为JSON编码的字符串，但我可以将其视为最后的手段;或者使用Encoders.kryo为AnyVal的Map生成编码器。我错过了什么，有更容易的方法来解决这个问题吗？

Answer 1

我只是将我的案例类中的可选值定义为Option，并将它们None作为默认值：

case class BusinessEntity(key: Long, value: String, opt1:Option[String]=None, opt2:Option[Long]=None)

如何在Spark 2.1中创建可以具有可选属性的类型化数据集

1 个答案: