我是Scala的新手,我有数据框,我在其他单词中尝试使用字符串中的一列数据框,如下所示
1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56)
2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11)
第一种情况我能够成功实现,但第二种情况的问题是时间错过了,因为我没有打包转换成日期。更多细节你可以得到以下。任何帮助将不胜感激。
df.printSchema
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: long (nullable = true)
|-- IN_DATE: string (nullable = true)
df.show
Input
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+-----------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610120256 |
| F | 000544| 2017002| LD | 95032015062763261| 20150611 |
| F | 000544| 2017002| AK | 95037854336743246| 20150611012356 |
+-----+-------+---------+--+------+-------------------+-----------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| null |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 00:00:00 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
答案 0 :(得分:1)
有几种方法可以实现日期解析器。
TODATE()
。该实施的Here's an example。答案 1 :(得分:1)
我'
TimestampType
。coalesce
格式不同。import org.apache.spark.sql.functions._
val df = Seq("20150610120256", "20150611").toDF("IN_DATE")
df.withColumn("IN_DATE", coalesce(
to_timestamp($"IN_DATE", "yyyyMMddHHmmss"),
to_timestamp($"IN_DATE", "yyyyMMdd"))).show
+-------------------+
| IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+
答案 2 :(得分:0)
尝试此查询
df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))
答案 3 :(得分:0)
2015-06-11
格式为spark.sql.types.DateType
,2015-06-10 12:02:56
为spark.sql.types.TimestampType
您在同一列上不能有两个 dataType 。 架构每个列只应有一个 dataType 。
我建议您创建两个新列,并将其中所需的格式设置为
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
.withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType))
这将为您提供dataframe
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE |SQ_CODE|RE_TYPE|VERY_ID |IN_DATE |IN_DATE_DateOnly|IN_DATE_DateAndTime |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F |000544|2017002|OP |95032015062763298|20150610120256|null |2015-06-10 12:02:00.0|
|F |000544|2017002|LD |95032015062763261|20150611 |2015-06-11 |null |
|F |000544|2017002|AK |95037854336743246|20150611012356|null |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
您可以看到 dataType 不同
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: string (nullable = true)
|-- IN_DATE: string (nullable = true)
|-- IN_DATE_DateOnly: date (nullable = true)
|-- IN_DATE_DateAndTime: timestamp (nullable = true)
我希望答案很有帮助