Question

默认情况下，当我加载数据时，每列都被视为字符串类型。数据如下：

firstName,lastName,age,doj
dileep,gog,21,2016-01-01
avishek,ganguly,21,2016-01-02
shreyas,t,20,2016-01-03

更新RDD的架构后，它看起来像

temp.printSchema
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- age: string (nullable = true)
|-- doj: date (nullable = true)

注册一个临时表并查询它

temp.registerTempTable("temptable");
 val temp1 = sqlContext.sql("select * from temptable")
 temp1.show()
+---------+--------+---+----------+
|firstName|lastName|age|       doj|
+---------+--------+---+----------+
|   dileep|     gog| 21|2016-01-01|
|  avishek| ganguly| 21|2016-01-02|
|  shreyas|       t| 20|2016-01-03|
+---------+--------+---+----------+
 val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")

但是当我试图看到它给我的结果时：

temp2: org.apache.spark.sql.DataFrame = [firstName: string, lastName: string, age: string, doj: date]

当我做的时候

temp2.show()
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

Answer 1

所以我尝试了你的代码，它对我有用。我怀疑问题在于你最初如何更改架构，这对我来说（当你在评论中发布时，授予有点难以阅读 - 你应该用代码更新问题）。

无论如何，我这样做了：

首先模拟你的输入：

val df = sc.parallelize(List(("dileep","gog","21","2016-01-01"), ("avishek","ganguly","21","2016-01-02"), ("shreyas","t","20","2016-01-03"))).toDF("firstName", "lastName", "age", "doj")

然后：

import org.apache.spark.sql.functions._

val temp = df.withColumn("doj", to_date('doj))
temp.registerTempTable("temptable");
val temp2 = sqlContext.sql("select * from temptable where doj > cast('2016-01-02' as date)")

执行a temp2.show()按预期显示：

+---------+--------+---+----------+
|firstName|lastName|age|       doj|
+---------+--------+---+----------+
|  shreyas|       t| 20|2016-01-03|
+---------+--------+---+----------+

将spark RDD中列的数据类型更改为日期并对其进行查询

1 个答案: