我正在使用输入数据框创建一个新的数据框(由案例类设置),该输入数据框的列数可能比现有列数少/不同。我正在尝试使用案例类将不存在的值设置为null。
我正在使用这种案例类来驱动要创建的新数据框。
输入数据框(incomingDf)可能没有上面设置为null的所有变量字段。
case class existingSchema(source_key: Int
, sequence_number: Int
, subscriber_id: String
, subscriber_ssn: String
, last_name: String
, first_name: String
, variable1: String = null
, variable2: String = null
, variable3: String = null
, variable4: String = null
, variable5: String = null
, source_date: Date
, load_date: Date
, file_name_String: String)
val incomingDf = spark.table("raw.incoming")
val formattedDf = incomingDf.as[existingSchema].toDF()
这会在编译时引发错误。
formattedDf的新模式应与案例类existingSchema具有相同的模式。
incomingDf.printSchema
root
|-- source_key: integer (nullable = true)
|-- sequence_number: integer (nullable = true)
|-- subscriber_id: string (nullable = true)
|-- subscriber_ssn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- first_name: string (nullable = true)
|-- variable1: string (nullable=true)
|-- variable3: string (nullable = true)
|-- source_date: date (nullable = true)
|-- load_date: date (nullable = true)
|-- file_name_string: string (nullable = true)
编译错误:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val formattedDf = incomingDf.as[existingSchema].toDF()
^
one error found
FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':compileScala'.
> Compilation failed
更新: 我添加了代码行:
import incomingDf.sparkSession.implicits._
并且编译很好。
我现在在运行时收到以下错误:
19/04/17 14:37:56 ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve '`variable2`' given input columns: [variable1, variable3, sequence_number, last_name, first_name, file_name_string, subscriber_id, load_date, source_key];
org.apache.spark.sql.AnalysisException: cannot resolve '`variable2`' given input columns: [variable1, variable3, sequence_number, last_name, first_name, file_name_string, subscriber_id, load_date, source_key];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
答案 0 :(得分:3)
您可能想专门定义DF模式。例如:
import org.apache.spark.sql.types._
val newSchema: StructType = StructType(Array(
StructField("nested_array", ArrayType(ArrayType(StringType)), true),
StructField("numbers", IntegerType, true),
StructField("text", StringType, true)
))
// Given a DataFrame df...
val combinedSchema = StructType(df.schema ++ newSchema)
val resultRDD = ... // here, process df to add rows or whatever and get the result as an RDD
// you can get an RDD as simply as df.rdd
val outDf = sparkSession.createDataFrame(resultRDD, combinedSchema)
[StructField][1]
参数的第三个成员确保新创建的字段可为空。默认值为true,因此您不必真正添加它,但是为了清楚起见,我将其包括在内,因为使用此方法的整个目的是创建一个专门可为空的字段。
答案 1 :(得分:1)
现有模式缺少case类的某些String字段。您只需要显式添加它们:
val formattedDf = Seq("variable2", "variable4", "variable5")
.foldLeft(incomingDf)((df, col) => {
df.withColumn(col, lit(null.asInstanceOf[String]))
}).as[existingSchema].toDF()
更通用的解决方案是推断缺少的字段。