我需要从值列中提取时间戳
我尝试执行getItem,但是不返回任何内容
val data = df.withColumn("splitted", split($"value", "/"))
.select($"splitted".getItem(6).alias("region"), $"splitted".getItem(7).alias("service"), col("value"))
.withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""", 1))
.withColumn("region_type", concat(
when(col("region").isNotNull, col("region")).otherwise(lit("null")), lit(" "),
when(col("service").isNotNull, col("service_type")).otherwise(lit("null"))))
.withColumn("splitt", split($"value", "\t")
.select($"splitt".getItem(1).alias("datetime"))
我需要从字符串下面的新列“ datetime” 2019-05-14 04:02:03中提取时间戳;
{"value":"2019-05-14T09:02:06.486Z index:: host:: 2019-05-14 04:02:03,307 INFO - \tTue May 14 04:02:03 CDT 2019\tID:<490744.1557824523305.0>\tsv\tAFTER_LOOKUP_QUERY_PARTNER_CHANNEL\t[messageData(DispatchID: 06708235871 Region: EMEA SubRegion: EU OperationType: <OperationType>STATUSUPDATE</OperationType> Operation: StatusUpdate)]\tms \t"}
答案 0 :(得分:1)
您可以使用regex_extract函数从字符串中仅提取时间戳,如下所示
df.withColumn("dateTime",
regexp_extract($"value", """\d{4}-[01]\d-[0-3]\d [0-2]\d:[0-5]\d:[0-5]\d""", 0)
).show(false)
输出:
+-------------------+
|dateTime |
+-------------------+
|2019-05-14 04:02:03|
+-------------------+