嗨我有一个RDD,它基本上是在读取CSV文件后制作的。 我已经定义了一个方法,它基本上根据输入参数将rdd的行映射到不同的case类。
返回的RDD需要转换为数据帧 当我尝试运行相同时,我得到以下错误。
定义的方法是
case class Australiafile1(sectionName: String, profitCentre: String, valueAgainst: String, Status: String)
case class Australiafile2(sectionName: String, profitCentre: String)
case class defaultclass(error: String)
def mapper(line: String, recordLayoutClassToBeUsed: String) = {
val fields = line.split(",")
var outclass = recordLayoutClassToBeUsed match {
case ("Australiafile1") => Australiafile1(fields(0), fields(1), fields(2), fields(3))
case ("Australiafile2") => Australiafile2(fields(0), fields(1))
}
outclass
}
该方法的输出用于创建数据框,如下所示
val inputlines = spark.sparkContext.textFile(inputFile).cache().mapPartitionsWithIndex { (idx, lines) => if (idx == 0) lines.drop(numberOfLinesToBeRemoved.toInt) else lines }.cache()
val records = inputlines.filter(x => !x.isEmpty).filter(x => x.split(",").length > 0).map(lines => mapper(lines, recordLayoutClassToBeUsed))
import spark.implicits._
val recordsDS = records.toDF()
recordsDS.createTempView("recordtable")
val output = spark.sql("select * from recordtable").toDF()
output.write.option("delimiter", "|").option("header", "false").mode("overwrite").csv(outputFile)
收到的错误如下
线程“main”中的异常java.lang.NoClassDefFoundError:找不到与Serializable相对应的Product的Java类 在scala.reflect.runtime.JavaMirrors $ JavaMirror.typeToJavaClass(JavaMirrors.scala:1300) 在scala.reflect.runtime.JavaMirrors $ JavaMirror.runtimeClass(JavaMirrors.scala:192) 在scala.reflect.runtime.JavaMirrors $ JavaMirror.runtimeClass(JavaMirrors.scala:54) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder $ .apply(ExpressionEncoder.scala:60) 在org.apache.spark.sql.Encoders $ .product(Encoders.scala:275) at org.apache.spark.sql.LowPrioritySQLImplicits $ class.newProductEncoder(SQLImplicits.scala:233) 在org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
你能告诉我这有什么问题吗,我怎么能克服这个?
答案 0 :(得分:1)
尝试:
trait AustraliaFile extends Serializable
case class Australiafile1(sectionName: String, profitCentre: String, valueAgainst: String, Status: String) extends AustraliaFile
case class Australiafile2(sectionName: String, profitCentre: String) extends AustraliaFile
您的类不是Serializable
,但Spark只能编写可序列化的对象。同样,最好将相关的类基于共同的祖先,这样您就可以将RDD声明为RDD[AustraliaFile]
而不是RDD[Any]
此外,您的类匹配逻辑可以简化为
def mapper(line: String, recordLayoutClassToBeUsed: String) = {
val fields = line.split(",")
recordLayoutClassToBeUsed match {
case ("Australiafile1") => Australiafile1(fields(0), fields(1), fields(2), fields(3))
case ("Australiafile2") => Australiafile2(fields(0), fields(1))
}
}