在spark scala / python中将日期动态转换为Timestamp [不提及日期格式]

时间:2018-01-04 17:52:12

标签: datetime apache-spark pyspark apache-spark-sql spark-dataframe

您需要将原始日期转换为时间戳

数据

id,date,date1,date2,date3
1,161129,19960316,992503,20140205
2,961209,19950325,992206,20140503
3,110620,19960522,991610,20131302
4,160928,19930506,992205,20160112
5,021002,20000326,991503,20131112
6,160721,19960909,991212,20151511
7,160721,20150101,990809,20140809
8,100903,20151212,990605,20011803
9,070713,20170526,990702,19911010 

这里我有专栏&#34; date&#34;,&#34; date1&#34;,&#34; date2&#34;和&#34; date3&#34;其中日期是字符串格式。通常我使用unix_timestamp("<col>","<formate>").cast("timestamp")转换原始日期,但现在我不想要提及格式,我想要动态方法,因为以后可能会有更多列添加到我的表中。在这种情况下,静态方法不会发挥最佳作用。

在某些列中,我们将有6个字符的日期,其中前2个字符代表&#34;年&#34;接下来4代表&#34; date&#34;和&#34;月&#34;即yyddmm或 YYMMDD。

其他一些列我们将有8个字符的日期,前4个字符代表&#34;年&#34;接下来4代表&#34; date&#34;和&#34;月&#34;即yyyyddmm或yyyymmdd。

我们对每个列都有相同的格式,需要动态查找并将其转换为时间戳而无需硬编码。

输出应该是时间戳。

+---+-------------------+-------------------+-------------------+-------------------+
| id|               date|              date1|              date2|              date3|
+---+-------------------+-------------------+-------------------+-------------------+
|  1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
|  2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
|  3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
|  4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
|  5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
|  6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
|  7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
|  8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
|  9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+

1 个答案:

答案 0 :(得分:1)

这里有我的上述要求。给定UDF中的一些条件以查找每个日期列的格式。

def udf_1(x:String):
    if len(x)==6 and int(x[-2:]) > 12: return "yyMMdd"
    elif len(x)==8 and int(x[-2:]) > 12: return "yyyyMMdd"
    elif len((x))==6 and int(x[2:4]) <12 and int(x[-2:]) >12: return "yyMMdd"
    elif len((x))==8 and int(x[4:6]) <12 and int(x[-2:]) >12: return "yyyyMMdd"
    elif len((x))==6 and int(x[2:4]) >12 and int(x[-2:]) <12: return "yyddMM"
    elif len((x))==8 and int(x[4:6]) >12 and int(x[-2:]) <12: return "yyyyddMM"
    elif len((x))==6 and int(x[2:4]) <=12 and int(x[-2:]) <=12: return "N"
    elif len((x))==8 and int(x[4:6]) <=12 and int(x[-2:]) <=12: return "NA"
    else: return "null"
udf_2 = udf(udf_1, StringType())
c1 = c.withColumn("date_formate",udf_2("date"))
c2 = c1.withColumn("date1_formate",udf_2("date1"))
c3 = c2.withColumn("date2_formate",udf_2("date2"))
c4 = c3.withColumn("date3_formate",udf_2("date3"))
c4.show()

使用指定的条件,我已经提取了某些行的格式,并且在日期和月份的情况下,&lt; = 12我已经给出了&#34; N&#34; 6个字符和&#34; NA&#34; 8个字符。

+------+--------+------+---------+---+------------+-------------+-------------+-------------+
|  date|   date1| date2|    date3| id|date_formate|date1_formate|date2_formate|date3_formate|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
|161129|19960316|992503| 20140205|  1|      yyMMdd|     yyyyMMdd|       yyddMM|           NA|
|961209|19950325|992206| 20140503|  2|           N|     yyyyMMdd|       yyddMM|           NA|
|110620|19960522|991610| 20131302|  3|      yyMMdd|     yyyyMMdd|       yyddMM|     yyyyddMM|
|160928|19930506|992205| 20160112|  4|      yyMMdd|           NA|       yyddMM|           NA|
|021002|20000326|991503| 20131112|  5|           N|     yyyyMMdd|       yyddMM|           NA|
|160421|19960909|991212| 20151511|  6|      yyMMdd|           NA|            N|     yyyyddMM|
|160721|20150101|990809| 20140809|  7|      yyMMdd|           NA|            N|           NA|
|100903|20151212|990605| 20011803|  8|           N|           NA|            N|     yyyyddMM|
|070713|20170526|990702|19911010 |  9|      yyMMdd|     yyyyMMdd|            N|     yyyyddMM|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+

现在我已经提取了格式并将其存储在变量中并在unix_timestamp中调用该变量以将原始日期转换为时间戳。

r1 = c4.where(c4.date_formate != ('NA' or 'N'))[['date_formate']].first().date_formate
t_s = unix_timestamp("date",r1).cast("timestamp")
c5=c4.withColumn("date",t_s)

r2 = c5.where(c5.date1_formate != ('NA' or 'N'))[['date1_formate']].first().date1_formate
t_s1 = unix_timestamp("date1",r2).cast("timestamp")
c6 = c5.withColumn("date1",t_s1)

r3 = c6.where(c6.date2_formate != ('NA' or 'N'))[['date2_formate']].first().date2_formate
t_s2 = unix_timestamp("date2",r3).cast("timestamp")
c7 = c6.withColumn("date2",t_s2)

r4 = c7.where(c7.date3_formate != ('NA' or 'N'))[['date3_formate']].first().date3_formate
t_s3 = unix_timestamp("date3",r4).cast("timestamp")
c8 = c7.withColumn("date3",t_s3)

c8.select("id","date","date1","date2","date3").show()

输出

+---+-------------------+-------------------+-------------------+-------------------+
| id|               date|              date1|              date2|              date3|
+---+-------------------+-------------------+-------------------+-------------------+
|  1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
|  2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
|  3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
|  4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
|  5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
|  6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
|  7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
|  8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
|  9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+