以下是有趣的:
val rddSTG = sc.parallelize(
List ( ("RTD","ANT","SOYA BEANS", "20161123", "20161123", 4000, "docid11", null, 5) ,
("RTD","ANT","SOYA BEANS", "20161124", "20161123", 6000, "docid11", null, 4) ,
("RTD","ANT","BANANAS", "20161124", "20161123", 7000, "docid11", null, 9) ,
("HAM","ANT","CORN", "20161123", "20161123", 1000, "docid22", null, 33),
("LIS","PAR","BARLEY", "20161123", "20161123", 11111, "docid33", null, 44)
)
)
val dataframe = rddSTG.toDF("ORIG", "DEST", "PROD", "PLDEPDATE", "PLARRDATE", "PLCOST", "docid", "ACTARRDATE", "mutationseq")
dataframe.createOrReplaceTempView("STG")
spark.sql("SELECT * FROM STG ORDER BY PLDEPDATE DESC").show()
它会产生如下错误:
scala.MatchError: Null (of class scala.reflect.internal.Types$TypeRef$$anon$6)
只要将其中一个空值更改为非null,就可以正常工作。我想我明白了,因为在场上没有任何推论,但看起来确实很奇怪。想法?
答案 0 :(得分:4)
问题是 - Any
在scala中太泛型了。在您的情况下,NULL
被视为ANY
类型。
Spark根本不知道如何序列化NULL
。
我们应该明确提供一些特定的类型。
由于无法将空值分配给Scala中的基元类型,因此可以使用String来匹配列的其他值的数据类型。
所以试试这个:
case class Record(id: Int, name: String, score: Int, flag: String)
val sampleRdd = spark.sparkContext.parallelize(
Seq(
(1, null.asInstanceOf[String], 100, "YES"),
(2, "RAKTOTPAL", 200, "NO"),
(3, "BORDOLOI", 300, "YES"),
(4, null.asInstanceOf[String], 400, "YES")))
sampleRdd.toDF("ID", "NAME", "SCORE","FLAG")
这样,df
将保留空值。
case class
case class Record(id: Int, name: String, score: Int, flag: String)
val sampleRdd = spark.sparkContext.parallelize(
Seq(
Record(1, null.asInstanceOf[String], 100, "YES"),
Record(2, "RAKTOTPAL", 200, "NO"),
Record(3, "BORDOLOI", 300, "YES"),
Record(4, null.asInstanceOf[String], 400, "YES")))
sampleRdd.toDF()
答案 1 :(得分:2)
我不太确定错误背后的原因,但我猜它正在发生,因为Null
不能是数据帧列的数据类型。由于您的第二个最后一列是null
,它是特征Null
的一部分。由于它们位于层次结构的底部,因此无法将它们实例化为任何其他类型。
但是,null
是所有内容的子类型,因此即使您将null
中的任何一个更改为String
,该列也会变为String
类型。这只是一个假设。
但是,对于您的情况,定义case class
将起作用。
val rdd = sc.parallelize(List ( ("RTD","ANT","SOYA BEANS", "20161123", "20161123", 4000, "docid11", null, 5) ,
("RTD","ANT","SOYA BEANS", "20161124", "20161123", 6000, "docid11", null, 4) ,
("RTD","ANT","BANANAS", "20161124", "20161123", 7000, "docid11", null, 9) ,
("HAM","ANT","CORN", "20161123", "20161123", 1000, "docid22", null, 33),
("LIS","PAR","BARLEY", "20161123", "20161123", 11111, "docid33", null, 44)))
case class df_schema (ORIG: String, DEST: String, PROD: String, PLDEPDATE:String, PLARRDATE: String, PLCOSTDATE: Int, DOCID: String, ACTARRDATE: String, MUTATIONSEQ: Int)
val rddSTG = rdd.map( x=> df_schema(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9 ) )
val dataframe = sqlContext.createDataFrame(rddSTG)
答案 2 :(得分:0)
您的简单解决方案是添加commit();
test line
val rddSTG = sc.parallelize(
Seq ( ("RTD","ANT","SOYA BEANS", "20161123", "20161123", 4000, "docid11", null, 5) ,
("RTD","ANT","SOYA BEANS", "20161124", "20161123", 6000, "docid11", null, 4) ,
("RTD","ANT","BANANAS", "20161124", "20161123", 7000, "docid11", null, 9) ,
("HAM","ANT","CORN", "20161123", "20161123", 1000, "docid22", null, 33),
("LIS","PAR","BARLEY", "20161123", "20161123", 11111, "docid33", null, 44),
("test","test","test", "test", "test", 0, "test", "", 0)
)
)
filter
之后的test line
已创建为
dataframe
您可以将其余的逻辑应用为
val dataframe = rddSTG.toDF("ORIG", "DEST", "PROD", "PLDEPDATE", "PLARRDATE", "PLCOST", "docid", "ACTARRDATE", "mutationseq").filter(!(col("ORIG") === "test"))
您应该输出
dataframe.createOrReplaceTempView("STG")
spark.sql("SELECT * FROM STG ORDER BY PLDEPDATE DESC").show()
我希望这很有用