让我们创建一个具有相同数据的DataFrame和Dataset:
val personDF = Seq(("Max", 33), ("Adam", 32), ("John", 62)).toDF("name", "age")
case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("John", 62)).toDS()
personDF.select("name").explain // DataFrame
// == Physical Plan ==
// LocalTableScan [name#14]
personDS.map(_.name).explain // Dataset
// == Physical Plan ==
// *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#29]
// +- *MapElements <function1>, obj#28: java.lang.String
// +- *DeserializeToObject newInstance(class $line129.$read$$iwC$$iwC$Person), obj#27: $line129.$read$$iwC$$iwC$Person
// +- LocalTableScan [name#2, age#3]
数据集物理计划会产生额外的DeserializeToObject
,MapElements
和SerializeFromObject
步骤。 这对性能有何影响?
编辑:
是否有可用的实验/基准进行比较: