字段中的空值会生成MatchError

时间:2017-06-22 12:21:28

标签: scala apache-spark

以下是有趣的:

val rddSTG = sc.parallelize(
      List ( ("RTD","ANT","SOYA BEANS", "20161123", "20161123", 4000, "docid11", null, 5) , 
             ("RTD","ANT","SOYA BEANS", "20161124", "20161123", 6000, "docid11",  null, 4) ,
             ("RTD","ANT","BANANAS", "20161124", "20161123", 7000, "docid11", null, 9) ,    
             ("HAM","ANT","CORN", "20161123", "20161123", 1000, "docid22", null, 33),
             ("LIS","PAR","BARLEY", "20161123", "20161123", 11111, "docid33", null, 44)
           )
                          )

val dataframe = rddSTG.toDF("ORIG", "DEST", "PROD", "PLDEPDATE", "PLARRDATE", "PLCOST", "docid", "ACTARRDATE", "mutationseq")
dataframe.createOrReplaceTempView("STG")
spark.sql("SELECT * FROM STG ORDER BY PLDEPDATE DESC").show()

它会产生如下错误:

scala.MatchError: Null (of class scala.reflect.internal.Types$TypeRef$$anon$6)

只要将其中一个空值更改为非null,就可以正常工作。我想我明白了,因为在场上没有任何推论,但看起来确实很奇怪。想法?

3 个答案:

答案 0 :(得分:4)

问题是 - Any在scala中太泛型了。在您的情况下,NULL被视为ANY类型。

Spark根本不知道如何序列化NULL

我们应该明确提供一些特定的类型。

由于无法将空值分配给Scala中的基元类型,因此可以使用String来匹配列的其他值的数据类型。

所以试试这个:

case class Record(id: Int, name: String, score: Int, flag: String)
val sampleRdd = spark.sparkContext.parallelize(
  Seq(
    (1, null.asInstanceOf[String], 100, "YES"),
    (2, "RAKTOTPAL", 200, "NO"),
    (3, "BORDOLOI", 300, "YES"),
    (4, null.asInstanceOf[String], 400, "YES")))

sampleRdd.toDF("ID", "NAME", "SCORE","FLAG")

这样,df将保留空值。

其他方式

case class

case class Record(id: Int, name: String, score: Int, flag: String)

val sampleRdd = spark.sparkContext.parallelize(
  Seq(
    Record(1, null.asInstanceOf[String], 100, "YES"),
    Record(2, "RAKTOTPAL", 200, "NO"),
    Record(3, "BORDOLOI", 300, "YES"),
    Record(4, null.asInstanceOf[String], 400, "YES")))
sampleRdd.toDF()

答案 1 :(得分:2)

我不太确定错误背后的原因,但我猜它正在发生,因为Null不能是数据帧列的数据类型。由于您的第二个最后一列是null,它是特征Null的一部分。由于它们位于层次结构的底部,因此无法将它们实例化为任何其他类型。 但是,null是所有内容的子类型,因此即使您将null中的任何一个更改为String,该列也会变为String类型。这只是一个假设。

但是,对于您的情况,定义case class将起作用。

val rdd = sc.parallelize(List ( ("RTD","ANT","SOYA BEANS", "20161123", "20161123", 4000, "docid11", null, 5) , 
            ("RTD","ANT","SOYA BEANS", "20161124", "20161123", 6000, "docid11",  null, 4) ,
            ("RTD","ANT","BANANAS", "20161124", "20161123", 7000, "docid11", null, 9) ,    
            ("HAM","ANT","CORN", "20161123", "20161123", 1000, "docid22", null, 33),
            ("LIS","PAR","BARLEY", "20161123", "20161123", 11111, "docid33", null, 44)))
case class df_schema (ORIG: String, DEST: String, PROD: String, PLDEPDATE:String, PLARRDATE: String, PLCOSTDATE: Int, DOCID: String, ACTARRDATE: String, MUTATIONSEQ: Int)
val rddSTG = rdd.map( x=> df_schema(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9 ) )
val dataframe = sqlContext.createDataFrame(rddSTG)

答案 2 :(得分:0)

您的简单解决方案是添加commit();

test line

val rddSTG = sc.parallelize( Seq ( ("RTD","ANT","SOYA BEANS", "20161123", "20161123", 4000, "docid11", null, 5) , ("RTD","ANT","SOYA BEANS", "20161124", "20161123", 6000, "docid11", null, 4) , ("RTD","ANT","BANANAS", "20161124", "20161123", 7000, "docid11", null, 9) , ("HAM","ANT","CORN", "20161123", "20161123", 1000, "docid22", null, 33), ("LIS","PAR","BARLEY", "20161123", "20161123", 11111, "docid33", null, 44), ("test","test","test", "test", "test", 0, "test", "", 0) ) ) filter之后的test line已创建为

dataframe

您可以将其余的逻辑应用为

val dataframe = rddSTG.toDF("ORIG", "DEST", "PROD", "PLDEPDATE", "PLARRDATE", "PLCOST", "docid", "ACTARRDATE", "mutationseq").filter(!(col("ORIG") === "test"))

您应该输出

dataframe.createOrReplaceTempView("STG")
spark.sql("SELECT * FROM STG ORDER BY PLDEPDATE DESC").show()

我希望这很有用