在spark.createDataset中为空

时间:2017-08-22 09:09:43

标签: scala apache-spark

我正在编写单元测试,测试数据需要有一些空值。 我尝试将空值直接放在元组中,我也尝试使用选项。它没有成功。

这是我的代码:

import sparkSession.implicits._
// Data set with null for even values
val sampleData = sparkSession.createDataset(Seq(
  (1, Some("Yes"), None),
  (2, None, None),
  (3, Some("Okay"), None),
  (4, None, None)))
  .toDF("id", "title", "value")

堆栈追踪:

None.type (of class scala.reflect.internal.Types$UniqueSingleType)
scala.MatchError: None.type (of class scala.reflect.internal.Types$UniqueSingleType)
    at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:472)
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
    at scala.collection.immutable.List.flatMap(List.scala:344)
    at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
    at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
    at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
    at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)

2 个答案:

答案 0 :(得分:2)

您应该使用None: Option[String]代替None

scala> val maybeString = None: Option[String]
maybeString: Option[String] = None

scala> val sampleData = spark.createDataset(Seq(
     |   (1, Some("Yes"), maybeString),
     |   (2, maybeString, maybeString),
     |   (3, Some("Okay"), maybeString),
     |   (4, maybeString, maybeString))).toDF("id", "title", "value")
sampleData: org.apache.spark.sql.DataFrame = [id: int, title: string ... 1 more field]

scala> sampleData.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
|  1|  Yes| null|
|  2| null| null|
|  3| Okay| null|
|  4| null| null|
+---+-----+-----+

答案 1 :(得分:1)

或者您可以使用:null.asInstanceOf[String]如果您只是处理字符串

val df1 = sc.parallelize(Seq((1, "Yes", null.asInstanceOf[String]),
     | (2, null.asInstanceOf[String], null.asInstanceOf[String]),
     | (3, "Okay", null.asInstanceOf[String]),
     | (4, null.asInstanceOf[String], null.asInstanceOf[String]))).toDF("id", "title", "value")
df1: org.apache.spark.sql.DataFrame = [id: int, title: string, value: string]

scala> df1.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
|  1|  Yes| null|
|  2| null| null|
|  3| Okay| null|
|  4| null| null|
+---+-----+-----+