我进行了以下火花炮练习:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala> case class Test(notNullable:String, nullable:Option[String])
defined class Test
scala> val myArray = Array(
| Test("x", None),
| Test("y", Some("z"))
| )
myArray: Array[Test] = Array(Test(x,None), Test(y,Some(z)))
scala> val rdd = sc.parallelize(myArray)
rdd: org.apache.spark.rdd.RDD[Test] = ParallelCollectionRDD[0] at parallelize at <console>:28
scala> rdd.toDF.printSchema
root
|-- notNullable: string (nullable = true)
|-- nullable: string (nullable = true)
我已经读过(Spark in Action)给定一个带有Option字段的case类,那些不可选的那些应该被推断为不可为空。这是真的吗?如果是这样我在这里做错了什么?
答案 0 :(得分:2)
这里有两个问题:
非可选字段仅针对某些类型(Int
,Long
,Short
,Double
,Float
推断为不可为空,Byte
,Boolean
)和String
显然不是其中之一;您可以查看Int
的行为,例如:
case class Test(notNullable: String,
nullable: Option[String],
notNullInt: Int,
nullableInt: Option[Int])
val myArray = Array(
Test("x", None, 1, None),
Test("y", Some("z"), 2, Some(3))
)
myArray.toSeq.toDF().printSchema
// root
// |-- notNullable: string (nullable = true)
// |-- nullable: string (nullable = true)
// |-- notNullInt: integer (nullable = false) // !!!
// |-- nullableInt: integer (nullable = true)
通过检查org.apache.spark.sql.catalyst.ScalaReflection.schemaFor
中的代码可以看出这一点:
def schemaFor(tpe: `Type`): Schema = ScalaReflectionLock.synchronized {
tpe match {
// ...
case t if t <:< localTypeOf[String] => Schema(StringType, nullable = true)
// ...
case t if t <:< definitions.IntTpe => Schema(IntegerType, nullable = false)
case t if t <:< definitions.LongTpe => Schema(LongType, nullable = false)
case t if t <:< definitions.DoubleTpe => Schema(DoubleType, nullable = false)
case t if t <:< definitions.FloatTpe => Schema(FloatType, nullable = false)
case t if t <:< definitions.ShortTpe => Schema(ShortType, nullable = false)
case t if t <:< definitions.ByteTpe => Schema(ByteType, nullable = false)
case t if t <:< definitions.BooleanTpe => Schema(BooleanType, nullable = false)
// ...
}
}
如果您首先创建RDD并且然后将其转换为DF,而不是转换本地集合,那么显然会有不同的代码路径来推断架构em>直接进入DF - 两者表现不同:
case class Test(notNullInt: Int, nullableInt: Option[Int])
val myArray = Array(
Test(1, None),
Test(2, Some(3))
)
sc.parallelize(myArray).toDF.printSchema
// root
// |-- notNullInt: integer (nullable = true) // NULLABLE TOO!
// |-- nullableInt: integer (nullable = true)
myArray.toSeq.toDF().printSchema
// root
// |-- notNullInt: integer (nullable = false)
// |-- nullableInt: integer (nullable = true)