如何使用嵌套的案例类架构模拟Spark Scala DataFrame?

时间:2018-09-18 19:10:24

标签: scala apache-spark

如何创建/模拟在顶层嵌套了案例类的Spark Scala数据框?

root
 |-- _id: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- animalCaseClass: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- gender: string (nullable = true)

我目前正在对一个函数进行单元测试,该函数在上述模式中输出一个数据帧。为了检查是否相等,我使用了toDF(),不幸的是,它在模拟的数据帧中为“ _id”提供了具有nullable = true的模式,从而使测试失败(请注意,函数的“ actual”输出对所有内容均具有nullable = true )。

我还尝试了通过另一种方式创建模拟的数据框,该方式会导致错误:https://pastebin.com/WtxtgMJA

这是我在这种方法中尝试过的:

import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema

val schema = List(
  StructField("_id", LongType, true),
  StructField("continent", StringType, true),
  StructField("animalCaseClass", animalSchema, true)
)

val data = Seq(Row(12345L, "Asia", AnimalCaseClass("tiger", "male")), Row(12346L, "Asia", AnimalCaseClass("tigress", "female")))

val expected = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

我必须使用这种方法使那些默认情况下toDF使nullable为false的字段的nullable为true。

我如何制作一个与模拟函数的输出具有相同架构的数据框并声明也可以是case类的值?

1 个答案:

答案 0 :(得分:0)

From the logs you provided, you can see that

Caused by: java.lang.RuntimeException: models.AnimalCaseClass is not a valid external type for schema of struct<name:String,gender:String,,... 3 more fields>

which means you are trying to insert an object type of AnimalCaseClass into a datatype of struct<name:String,gender:String> and this was caused since you have used Row object.

import org.apache.spark.SparkConf
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.SparkSession

case class AnimalCaseClass(name: String, gender: String)

object Test extends App {

  val conf: SparkConf = new SparkConf()
  conf.setAppName("Test")
  conf.setMaster("local[2]")
  conf.set("spark.sql.test", "")
  conf.set(SQLConf.CODEGEN_FALLBACK.key, "false")

  val spark: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()

  // ** The relevant part **
  import org.apache.spark.sql.Encoders
  val animalSchema = Encoders.product[AnimalCaseClass].schema

  val expectedSchema: StructType = StructType(Seq(
    StructField("_id", LongType, true),
    StructField("continent", StringType, true),
    StructField("animalCaseClass", animalSchema, true)
  ))

  import spark.implicits._
  val data = Seq((12345L, "Asia", AnimalCaseClass("tiger", "male")), (12346L, "Asia", AnimalCaseClass("tigress", "female"))).toDF()

  val expected = spark.createDataFrame(data.rdd, expectedSchema)

  expected.printSchema()

  expected.show()

  spark.stop()
}