如何从数据框的值列中提取特定字符串

时间:2019-05-15 06:57:51

标签: scala apache-spark

我需要从值列中提取时间戳

我尝试执行getItem,但是不返回任何内容

val data = df.withColumn("splitted", split($"value", "/"))
      .select($"splitted".getItem(6).alias("region"), $"splitted".getItem(7).alias("service"), col("value"))
      .withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""", 1))
      .withColumn("region_type", concat(
        when(col("region").isNotNull, col("region")).otherwise(lit("null")), lit(" "),
        when(col("service").isNotNull, col("service_type")).otherwise(lit("null"))))
      .withColumn("splitt", split($"value", "\t")
      .select($"splitt".getItem(1).alias("datetime"))

我需要从字符串下面的新列“ datetime” 2019-05-14 04:02:03中提取时间戳;

{"value":"2019-05-14T09:02:06.486Z index:: host:: 2019-05-14 04:02:03,307 INFO  - \tTue May 14 04:02:03 CDT 2019\tID:<490744.1557824523305.0>\tsv\tAFTER_LOOKUP_QUERY_PARTNER_CHANNEL\t[messageData(DispatchID: 06708235871 Region: EMEA SubRegion: EU OperationType: <OperationType>STATUSUPDATE</OperationType> Operation: StatusUpdate)]\tms \t"}

1 个答案:

答案 0 :(得分:1)

您可以使用regex_extract函数从字符串中仅提取时间戳,如下所示

df.withColumn("dateTime", 
      regexp_extract($"value", """\d{4}-[01]\d-[0-3]\d [0-2]\d:[0-5]\d:[0-5]\d""", 0)
).show(false)

输出:

+-------------------+
|dateTime           |
+-------------------+
|2019-05-14 04:02:03|
+-------------------+