在Spark中将20180918转换为2018-09-18?

时间:2019-03-19 11:41:32

标签: scala apache-spark apache-spark-sql

给出数据框:

+-----------------+---------+-----------------+
|   user_id|       id|     date|discount|year|
+-----------------+---------+-----------------+
|  44143827|118775509| 20180103|     0.0|2018|                  
|  16445775|118841685| 20180105|     0.0|2018|                  
|  25230573|119388676| 20180111|     0.0|2018|                  
|  44634333|119537508| 20180112|     0.0|2018| 

我想将此日期从yyyyMMdd转换为yyyy-MM-dd。虽然我可以针对单个值执行此操作,即

scala> val x = "20180918"
x: String = 20180918

scala> x.patch(4,"-",0)
res76: String = 2018-0918

scala> x.patch(4,"-",0).patch(7,"-",0)
res77: String = 2018-09-18

但是无法找出完整的数据集。请有人帮忙。

3 个答案:

答案 0 :(得分:1)

使用date_format()和to_timestamp()函数。检查一下:

scala> val df = Seq((20180103),(20180105)).toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: int]

scala> df.withColumn("dt",'dt.cast("string")).withColumn("dt",date_format(to_timestamp('dt,"yyyyMMdd"),"yyyy-MM-dd")).show(false)
+----------+
|dt        |
+----------+
|2018-01-03|
|2018-01-05|
+----------+

scala>

请注意,date_format返回字符串,如果要使用日期数据类型,则返回

scala> val df2 = df.withColumn("dt",'dt.cast("string")).withColumn("dt",date_format(to_timestamp('dt,"yyyyMMdd"),"yyyy-MM-dd"))
df2: org.apache.spark.sql.DataFrame = [dt: string]

scala> df2.printSchema
root
 |-- dt: string (nullable = true)


scala> val df3 = df2.withColumn("dt",'dt.cast("date"))
df3: org.apache.spark.sql.DataFrame = [dt: date]

scala> df3.printSchema
root
 |-- dt: date (nullable = true)


scala> df3.show(false)
+----------+
|dt        |
+----------+
|2018-01-03|
|2018-01-05|
+----------+


scala>

答案 1 :(得分:0)

假设您要使用字符串作为输出,则可以创建一个新的UDF,将输入的字符串从yyyyMMdd转换为yyyy-MM-dd格式,如下所示:

def dateFormatDef(x: String): String = x.patch(4,"-",0).patch(7,"-",0)
val dateFormat = udf[String, String](dateFormatDef)

字符串以预期格式输出:

df = df.withColumn("newFormat", dateFormat($"date"))
df.show()
+--------+----------+
|    date| newFormat|
+--------+----------+
|20180103|2018-01-03|
|20180105|2018-01-05|
|20180111|2018-01-11|
|20180112|2018-01-12|
+--------+----------+

答案 2 :(得分:0)

在Pyspark中,您可以像下面这样

# create a data frame
df = sqlContext.createDataFrame(
[
("SirChillingtonIV", "20120104"), 
("Booooooo99900098", "20120104"), 
("Booooooo99900098", "20120106"), 
("OprahWinfreyJr", "20120110"), 
("SirChillingtonIV", "20120111"), 
("SirChillingtonIV", "20120114"), 
("SirChillingtonIV", "20120811")
], 
("user_name", "login_date"))


# Import functions
from pyspark.sql import functions as f

# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'yyyyMMdd'),'yyyy-MM-dd'))