我正在编写单元测试,测试数据需要有一些空值。 我尝试将空值直接放在元组中,我也尝试使用选项。它没有成功。
这是我的代码:
import sparkSession.implicits._
// Data set with null for even values
val sampleData = sparkSession.createDataset(Seq(
(1, Some("Yes"), None),
(2, None, None),
(3, Some("Okay"), None),
(4, None, None)))
.toDF("id", "title", "value")
堆栈追踪:
None.type (of class scala.reflect.internal.Types$UniqueSingleType)
scala.MatchError: None.type (of class scala.reflect.internal.Types$UniqueSingleType)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:472)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
答案 0 :(得分:2)
您应该使用None: Option[String]
代替None
scala> val maybeString = None: Option[String]
maybeString: Option[String] = None
scala> val sampleData = spark.createDataset(Seq(
| (1, Some("Yes"), maybeString),
| (2, maybeString, maybeString),
| (3, Some("Okay"), maybeString),
| (4, maybeString, maybeString))).toDF("id", "title", "value")
sampleData: org.apache.spark.sql.DataFrame = [id: int, title: string ... 1 more field]
scala> sampleData.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+
答案 1 :(得分:1)
或者您可以使用:null.asInstanceOf[String]
如果您只是处理字符串
val df1 = sc.parallelize(Seq((1, "Yes", null.asInstanceOf[String]),
| (2, null.asInstanceOf[String], null.asInstanceOf[String]),
| (3, "Okay", null.asInstanceOf[String]),
| (4, null.asInstanceOf[String], null.asInstanceOf[String]))).toDF("id", "title", "value")
df1: org.apache.spark.sql.DataFrame = [id: int, title: string, value: string]
scala> df1.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+