您需要将原始日期转换为时间戳
数据
id,date,date1,date2,date3
1,161129,19960316,992503,20140205
2,961209,19950325,992206,20140503
3,110620,19960522,991610,20131302
4,160928,19930506,992205,20160112
5,021002,20000326,991503,20131112
6,160721,19960909,991212,20151511
7,160721,20150101,990809,20140809
8,100903,20151212,990605,20011803
9,070713,20170526,990702,19911010
这里我有专栏&#34; date&#34;,&#34; date1&#34;,&#34; date2&#34;和&#34; date3&#34;其中日期是字符串格式。通常我使用unix_timestamp("<col>","<formate>").cast("timestamp")
转换原始日期,但现在我不想要提及格式,我想要动态方法,因为以后可能会有更多列添加到我的表中。在这种情况下,静态方法不会发挥最佳作用。
在某些列中,我们将有6个字符的日期,其中前2个字符代表&#34;年&#34;接下来4代表&#34; date&#34;和&#34;月&#34;即yyddmm或 YYMMDD。
其他一些列我们将有8个字符的日期,前4个字符代表&#34;年&#34;接下来4代表&#34; date&#34;和&#34;月&#34;即yyyyddmm或yyyymmdd。
我们对每个列都有相同的格式,需要动态查找并将其转换为时间戳而无需硬编码。
输出应该是时间戳。
+---+-------------------+-------------------+-------------------+-------------------+
| id| date| date1| date2| date3|
+---+-------------------+-------------------+-------------------+-------------------+
| 1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
| 2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
| 3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
| 4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
| 5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
| 6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
| 7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
| 8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
| 9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+
答案 0 :(得分:1)
这里有我的上述要求。给定UDF中的一些条件以查找每个日期列的格式。
def udf_1(x:String):
if len(x)==6 and int(x[-2:]) > 12: return "yyMMdd"
elif len(x)==8 and int(x[-2:]) > 12: return "yyyyMMdd"
elif len((x))==6 and int(x[2:4]) <12 and int(x[-2:]) >12: return "yyMMdd"
elif len((x))==8 and int(x[4:6]) <12 and int(x[-2:]) >12: return "yyyyMMdd"
elif len((x))==6 and int(x[2:4]) >12 and int(x[-2:]) <12: return "yyddMM"
elif len((x))==8 and int(x[4:6]) >12 and int(x[-2:]) <12: return "yyyyddMM"
elif len((x))==6 and int(x[2:4]) <=12 and int(x[-2:]) <=12: return "N"
elif len((x))==8 and int(x[4:6]) <=12 and int(x[-2:]) <=12: return "NA"
else: return "null"
udf_2 = udf(udf_1, StringType())
c1 = c.withColumn("date_formate",udf_2("date"))
c2 = c1.withColumn("date1_formate",udf_2("date1"))
c3 = c2.withColumn("date2_formate",udf_2("date2"))
c4 = c3.withColumn("date3_formate",udf_2("date3"))
c4.show()
使用指定的条件,我已经提取了某些行的格式,并且在日期和月份的情况下,&lt; = 12我已经给出了&#34; N&#34; 6个字符和&#34; NA&#34; 8个字符。
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
| date| date1| date2| date3| id|date_formate|date1_formate|date2_formate|date3_formate|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
|161129|19960316|992503| 20140205| 1| yyMMdd| yyyyMMdd| yyddMM| NA|
|961209|19950325|992206| 20140503| 2| N| yyyyMMdd| yyddMM| NA|
|110620|19960522|991610| 20131302| 3| yyMMdd| yyyyMMdd| yyddMM| yyyyddMM|
|160928|19930506|992205| 20160112| 4| yyMMdd| NA| yyddMM| NA|
|021002|20000326|991503| 20131112| 5| N| yyyyMMdd| yyddMM| NA|
|160421|19960909|991212| 20151511| 6| yyMMdd| NA| N| yyyyddMM|
|160721|20150101|990809| 20140809| 7| yyMMdd| NA| N| NA|
|100903|20151212|990605| 20011803| 8| N| NA| N| yyyyddMM|
|070713|20170526|990702|19911010 | 9| yyMMdd| yyyyMMdd| N| yyyyddMM|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
现在我已经提取了格式并将其存储在变量中并在unix_timestamp中调用该变量以将原始日期转换为时间戳。
r1 = c4.where(c4.date_formate != ('NA' or 'N'))[['date_formate']].first().date_formate
t_s = unix_timestamp("date",r1).cast("timestamp")
c5=c4.withColumn("date",t_s)
r2 = c5.where(c5.date1_formate != ('NA' or 'N'))[['date1_formate']].first().date1_formate
t_s1 = unix_timestamp("date1",r2).cast("timestamp")
c6 = c5.withColumn("date1",t_s1)
r3 = c6.where(c6.date2_formate != ('NA' or 'N'))[['date2_formate']].first().date2_formate
t_s2 = unix_timestamp("date2",r3).cast("timestamp")
c7 = c6.withColumn("date2",t_s2)
r4 = c7.where(c7.date3_formate != ('NA' or 'N'))[['date3_formate']].first().date3_formate
t_s3 = unix_timestamp("date3",r4).cast("timestamp")
c8 = c7.withColumn("date3",t_s3)
c8.select("id","date","date1","date2","date3").show()
输出
+---+-------------------+-------------------+-------------------+-------------------+
| id| date| date1| date2| date3|
+---+-------------------+-------------------+-------------------+-------------------+
| 1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
| 2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
| 3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
| 4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
| 5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
| 6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
| 7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
| 8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
| 9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+