从Spark Scala的字母数字字符串中排除字母和特殊字符

时间:2020-01-15 02:59:23

标签: scala apache-spark

如何使用带有scala的spark 2.0,如何从字符串中排除所有字母,而仅在单独的列中保留数字值。

输入

  "ActivalteTime": "PT5M", 
  "ReActivalteTime": "xy20$", 

输出

  "NewActivalteTime": "5", 
  "NewReActivalteTime": "20", 

请帮助

2 个答案:

答案 0 :(得分:0)

使用 Regexp_extract 函数从字符串中仅提取数字。

val df=Seq((""""ActivalteTime": "PT5M","""),(""""ReActivalteTime": "xy20$",""")).toDF("text")

df.show(false)

结果:

+---------------------------+
|text                       |
+---------------------------+
|"ActivalteTime": "PT5M",   |
|"ReActivalteTime": "xy20$",|
+---------------------------+

使用Regexp_extract

df.withColumn("num",regexp_extract($"text","(\\d+)",1)).show(false)

+---------------------------+---+
|text                       |num|
+---------------------------+---+
|"ActivalteTime": "PT5M",   |5  |
|"ReActivalteTime": "xy20$",|20 |
+---------------------------+---+

答案 1 :(得分:0)

这是一种稍微通用的方法,用于处理要使用regexp_extract提取数字内容的任意列列表:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  (1, "A", "PT5M", "xy20$", "M100.1!"),
  (2, "B", "QU6N", "uv%", "N200.2&")
).toDF("C1", "C2", "C3", "C4", "C5")

val colsToExtract = Seq("C3", "C4", "C5")
val colsRemained = df.columns diff colsToExtract

val prefix = "New"

df.select(colsRemained.map(col) ++ colsToExtract.map(c =>
    regexp_extract(col(c), "([0-9.]+)", 1).as(s"${prefix}$c")): _*
  ).show
// +---+---+-----+-----+-----+
// | C1| C2|NewC3|NewC4|NewC5|
// +---+---+-----+-----+-----+
// |  1|  A|    5|   20|100.1|
// |  2|  B|    6|     |200.2|
// +---+---+-----+-----+-----+