Question

我试图将一些常见代码提取到抽象类中，但遇到了问题。

让我们说我正在阅读一个格式为＆＃34; id | name＆＃34;的文件：

case class Person(id: Int, name: String) extends Serializable

object Persons {
  def apply(lines: Dataset[String]): Dataset[Person] = {
    import lines.sparkSession.implicits._
    lines.map(line => {
      val fields = line.split("\\|")
      Person(fields(0).toInt, fields(1))
    })
  }
}

Persons(spark.read.textFile("persons.txt")).show()

大。这很好用。现在让我们说我想用＆＃34; name＆＃34;来阅读许多不同的文件。字段，所以我将提取出所有常见的逻辑：

trait Named extends Serializable { val name: String }

abstract class NamedDataset[T <: Named] {
  def createRecord(fields: Array[String]): T
  def apply(lines: Dataset[String]): Dataset[T] = {
    import lines.sparkSession.implicits._
    lines.map(line => createRecord(line.split("\\|")))
  }
}

case class Person(id: Int, name: String) extends Named

object Persons extends NamedDataset[Person] {
  override def createRecord(fields: Array[String]) =
    Person(fields(0).toInt, fields(1))
}

这失败了两个错误：

Error:
Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) 
are supported by importing spark.implicits._  Support for serializing 
other types will be added in future releases.
lines.map(line => createRecord(line.split("\\|")))

Error:
not enough arguments for method map: 
(implicit evidence$7: org.apache.spark.sql.Encoder[T])org.apache.spark.sql.Dataset[T].
Unspecified value parameter evidence$7.
lines.map(line => createRecord(line.split("\\|")))

我觉得这与implicits，TypeTags和/或ClassTags有关，但我刚开始使用Scala并且还没有完全理解这些概念。

Answer 1

你必须进行两处小改动：

由于仅支持原语和Product（作为错误消息状态），因此使Named特征Serializable不够。你应该扩展Product（这意味着案例类和元组可以扩展它）
确实，Spark需要ClassTag和TypeTag来克服类型擦除并找出实际类型

所以 - 这是一个有效的版本：

import scala.reflect.ClassTag
import scala.reflect.runtime.universe.TypeTag

trait Named extends Product { val name: String }

abstract class NamedDataset[T <: Named : ClassTag : TypeTag] extends Serializable {
  def createRecord(fields: Array[String]): T
  def apply(lines: Dataset[String]): Dataset[T] = {
    import lines.sparkSession.implicits._
    lines.map(line => createRecord(line.split("\\|")))
  }
}

case class Person(id: Int, name: String) extends Named

object Persons extends NamedDataset[Person] {
  override def createRecord(fields: Array[String]) =
    Person(fields(0).toInt, fields(1))
}

Apache Spark - 数据集操作在抽象基类中失败了吗？

1 个答案: