基于不同的案例类创建数据集

时间:2018-01-19 10:33:48

标签: scala apache-spark pattern-matching case-class

嗨我有一个RDD,它基本上是在读取CSV文件后制作的。 我已经定义了一个方法,它基本上根据输入参数将rdd的行映射到不同的case类。

返回的RDD需要转换为数据帧 当我尝试运行相同时,我得到以下错误。

定义的方法是

  case class Australiafile1(sectionName: String, profitCentre: String, valueAgainst: String, Status: String)

  case class Australiafile2(sectionName: String, profitCentre: String)

  case class defaultclass(error: String)

  def mapper(line: String, recordLayoutClassToBeUsed: String) = {

    val fields = line.split(",")
    var outclass = recordLayoutClassToBeUsed match {
      case ("Australiafile1") => Australiafile1(fields(0), fields(1), fields(2), fields(3))
      case ("Australiafile2") => Australiafile2(fields(0), fields(1))
    }
    outclass

  }

该方法的输出用于创建数据框,如下所示

      val inputlines = spark.sparkContext.textFile(inputFile).cache().mapPartitionsWithIndex { (idx, lines) => if (idx == 0) lines.drop(numberOfLinesToBeRemoved.toInt) else lines }.cache()
      val records = inputlines.filter(x => !x.isEmpty).filter(x => x.split(",").length > 0).map(lines => mapper(lines, recordLayoutClassToBeUsed))

      import spark.implicits._

      val recordsDS = records.toDF()
      recordsDS.createTempView("recordtable")
      val output = spark.sql("select * from recordtable").toDF()
      output.write.option("delimiter", "|").option("header", "false").mode("overwrite").csv(outputFile)

收到的错误如下

  
    

线程“main”中的异常java.lang.NoClassDefFoundError:找不到与Serializable相对应的Product的Java类             在scala.reflect.runtime.JavaMirrors $ JavaMirror.typeToJavaClass(JavaMirrors.scala:1300)             在scala.reflect.runtime.JavaMirrors $ JavaMirror.runtimeClass(JavaMirrors.scala:192)             在scala.reflect.runtime.JavaMirrors $ JavaMirror.runtimeClass(JavaMirrors.scala:54)             at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder $ .apply(ExpressionEncoder.scala:60)             在org.apache.spark.sql.Encoders $ .product(Encoders.scala:275)             at org.apache.spark.sql.LowPrioritySQLImplicits $ class.newProductEncoder(SQLImplicits.scala:233)             在org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)

  

你能告诉我这有什么问题吗,我怎么能克服这个?

1 个答案:

答案 0 :(得分:1)

尝试:

trait AustraliaFile extends Serializable

case class Australiafile1(sectionName: String, profitCentre: String, valueAgainst: String, Status: String) extends AustraliaFile

case class Australiafile2(sectionName: String, profitCentre: String) extends AustraliaFile

您的类不是Serializable,但Spark只能编写可序列化的对象。同样,最好将相关的类基于共同的祖先,这样您就可以将RDD声明为RDD[AustraliaFile]而不是RDD[Any]

此外,您的类匹配逻辑可以简化为

def mapper(line: String, recordLayoutClassToBeUsed: String) = {
  val fields = line.split(",")
  recordLayoutClassToBeUsed match {
     case ("Australiafile1") => Australiafile1(fields(0), fields(1), fields(2), fields(3))
    case ("Australiafile2") => Australiafile2(fields(0), fields(1))
  }
}