将spark RDD中列的数据类型更改为日期并对其进行查询

时间:2016-07-28 09:49:04

标签: apache-spark apache-spark-sql

默认情况下,当我加载数据时,每列都被视为字符串类型。数据如下:

firstName,lastName,age,doj
dileep,gog,21,2016-01-01
avishek,ganguly,21,2016-01-02
shreyas,t,20,2016-01-03

更新RDD的架构后,它看起来像

temp.printSchema
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- age: string (nullable = true)
|-- doj: date (nullable = true)

注册一个临时表并查询它

temp.registerTempTable("temptable");
 val temp1 = sqlContext.sql("select * from temptable")
 temp1.show()
+---------+--------+---+----------+
|firstName|lastName|age|       doj|
+---------+--------+---+----------+
|   dileep|     gog| 21|2016-01-01|
|  avishek| ganguly| 21|2016-01-02|
|  shreyas|       t| 20|2016-01-03|
+---------+--------+---+----------+
 val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")

但是当我试图看到它给我的结果时:

temp2: org.apache.spark.sql.DataFrame = [firstName: string, lastName: string, age: string, doj: date]

当我做的时候

temp2.show()
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

1 个答案:

答案 0 :(得分:0)

所以我尝试了你的代码,它对我有用。我怀疑问题在于你最初如何更改架构,这对我来说(当你在评论中发布时,授予有点难以阅读 - 你应该用代码更新问题)。

无论如何,我这样做了:

首先模拟你的输入:

val df = sc.parallelize(List(("dileep","gog","21","2016-01-01"), ("avishek","ganguly","21","2016-01-02"), ("shreyas","t","20","2016-01-03"))).toDF("firstName", "lastName", "age", "doj")

然后:

import org.apache.spark.sql.functions._

val temp = df.withColumn("doj", to_date('doj))
temp.registerTempTable("temptable");
val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")

执行a temp2.show()按预期显示:

+---------+--------+---+----------+
|firstName|lastName|age|       doj|
+---------+--------+---+----------+
|  shreyas|       t| 20|2016-01-03|
+---------+--------+---+----------+