使用案例类将未知列添加为空

时间:2019-04-17 13:09:28

标签: scala apache-spark

我正在使用输入数据框创建一个新的数据框(由案例类设置),该输入数据框的列数可能比现有列数少/不同。我正在尝试使用案例类将不存在的值设置为null。

我正在使用这种案例类来驱动要创建的新数据框。

输入数据框(incomingDf)可能没有上面设置为null的所有变量字段。

case class existingSchema(source_key: Int
                        , sequence_number: Int
                        , subscriber_id: String
                        , subscriber_ssn: String
                        , last_name: String
                        , first_name: String
                        , variable1: String = null
                        , variable2: String = null
                        , variable3: String = null
                        , variable4: String = null
                        , variable5: String = null
                        , source_date: Date
                        , load_date: Date
                        , file_name_String: String)

val incomingDf = spark.table("raw.incoming")

val formattedDf = incomingDf.as[existingSchema].toDF()

这会在编译时引发错误。

formattedDf的新模式应与案例类existingSchema具有相同的模式。

incomingDf.printSchema
root
 |-- source_key: integer (nullable = true)
 |-- sequence_number: integer (nullable = true)
 |-- subscriber_id: string (nullable = true)
 |-- subscriber_ssn: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- variable1: string (nullable=true)
 |-- variable3: string (nullable = true)
 |-- source_date: date (nullable = true)
 |-- load_date: date (nullable = true)
 |-- file_name_string: string (nullable = true)

编译错误:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
    val formattedDf = incomingDf.as[existingSchema].toDF()                                                                                                                                                                                                                                                                               
                                                     ^                                                                                                                                                                                                                                                                                                 
one error found                                                                                                                                                                                                                                                                                                                                        
 FAILED                                                                                                                                                                                                                                                                                                                                                

FAILURE: Build failed with an exception.                                                                                                                                                                                                                                                                                                               

* What went wrong:                                                                                                                                                                                                                                                                                                                                     
Execution failed for task ':compileScala'.                                                                                                                                                                                                                                                                                                             
> Compilation failed 

更新: 我添加了代码行:

import incomingDf.sparkSession.implicits._

并且编译很好。

我现在在运行时收到以下错误:

19/04/17 14:37:56 ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve '`variable2`' given input columns: [variable1, variable3, sequence_number, last_name, first_name, file_name_string, subscriber_id, load_date, source_key];
org.apache.spark.sql.AnalysisException: cannot resolve '`variable2`' given input columns: [variable1, variable3, sequence_number, last_name, first_name, file_name_string, subscriber_id, load_date, source_key];
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)

2 个答案:

答案 0 :(得分:3)

您可能想专门定义DF模式。例如:

import org.apache.spark.sql.types._

val newSchema: StructType = StructType(Array(
  StructField("nested_array", ArrayType(ArrayType(StringType)), true),
  StructField("numbers", IntegerType, true),
  StructField("text", StringType, true)
))

// Given a DataFrame df...
val combinedSchema = StructType(df.schema ++ newSchema)
val resultRDD = ... // here, process df to add rows or whatever and get the result as an RDD
                    // you can get an RDD as simply as df.rdd
val outDf = sparkSession.createDataFrame(resultRDD, combinedSchema)

[StructField][1]参数的第三个成员确保新创建的字段可为空。默认值为true,因此您不必真正添加它,但是为了清楚起见,我将其包括在内,因为使用此方法的整个目的是创建一个专门可为空的字段。

答案 1 :(得分:1)

现有模式缺少case类的某些String字段。您只需要显式添加它们:

val formattedDf = Seq("variable2", "variable4", "variable5")
  .foldLeft(incomingDf)((df, col) => {
    df.withColumn(col, lit(null.asInstanceOf[String]))
  }).as[existingSchema].toDF()

更通用的解决方案是推断缺少的字段。