Question

对于Pojo的/原语，

Spark数据集从Row转移到Encoder。 Catalyst引擎使用ExpressionEncoder转换SQL表达式中的列。但是，似乎没有其他Encoder子类可用作我们自己实现的模板。

以下是Spark 1.X / DataFrames中无法在新系统中编译的代码示例：

//mapping each row to RDD tuple
df.map(row => {
    var id: String = if (!has_id) "" else row.getAs[String]("id")
    var label: String = row.getAs[String]("label")
    val channels  : Int = if (!has_channels) 0 else row.getAs[Int]("channels")
    val height  : Int = if (!has_height) 0 else row.getAs[Int]("height")
    val width : Int = if (!has_width) 0 else row.getAs[Int]("width")
    val data : Array[Byte] = row.getAs[Any]("data") match {
      case str: String => str.getBytes
      case arr: Array[Byte@unchecked] => arr
      case _ => {
        log.error("Unsupport value type")
        null
      }
    }
    (id, label, channels, height, width, data)
  }).persist(StorageLevel.DISK_ONLY)

}

我们收到编译错误

Error:(56, 11) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported 
by importing spark.implicits._  Support for serializing other types will be added in future releases.
    df.map(row => {
          ^

那么不知何故/某处应该有一种方法

定义/实现我们的自定义编码器
在DataFrame上执行映射时应用它（现在是Row类型的数据集）
注册编码器以供其他自定义代码使用

我正在寻找成功执行这些步骤的代码。

Answer 1

据我所知，自1.6以来没有任何改变，How to store custom objects in Dataset?中描述的解决方案是唯一可用的选项。尽管如此，您当前的代码应该可以正常使用产品类型的默认编码器。

要了解您的代码在1.x中工作的原因并且可能无法在2.0.0中运行，您必须检查签名。 1.x mGoogleApiClient = new GoogleApiClient.Builder(this) .addConnectionCallbacks(this) .addOnConnectionFailedListener(this) .addApi(LocationServices.API) .addApi(Places.GEO_DATA_API) .addApi(Places.PLACE_DETECTION_API) .enableAutoManage(this, this) .build();是一种方法，它将函数DataFrame.map转换为Row => T转换为RDD[Row]。

在2.0.0 RDD[T]中也会使用DataFrame.map类型的函数，但会将Row => T（又名Dataset[Row]）转换为DataFrame，因此{{1}需要Dataset[T]。如果你想得到'＃34; old＆＃34;您应该明确使用T的行为：

Encoder

对于RDD df.rdd.map(row => ???)，请参阅Encoder error while trying to map dataframe row to updated row

Answer 2

您导入了隐式编码器吗？

导入spark.implicits ._

http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Encoder

Answer 3

我导入了spark.implicits._其中spark是SparkSession，它解决了错误并导入了自定义编码器。

此外，编写自定义编码器是我从未尝试过的出路。

工作解决方案：- 创建SparkSession并导入以下内容

导入spark.implicits ._

如何在Spark 2.X数据集中创建自定义编码器？

3 个答案: