如何使用带有scala的spark 2.0,如何从字符串中排除所有字母,而仅在单独的列中保留数字值。
输入
"ActivalteTime": "PT5M",
"ReActivalteTime": "xy20$",
输出
"NewActivalteTime": "5",
"NewReActivalteTime": "20",
请帮助
答案 0 :(得分:0)
使用 Regexp_extract
函数从字符串中仅提取数字。
val df=Seq((""""ActivalteTime": "PT5M","""),(""""ReActivalteTime": "xy20$",""")).toDF("text")
df.show(false)
结果:
+---------------------------+
|text |
+---------------------------+
|"ActivalteTime": "PT5M", |
|"ReActivalteTime": "xy20$",|
+---------------------------+
使用Regexp_extract
:
df.withColumn("num",regexp_extract($"text","(\\d+)",1)).show(false)
+---------------------------+---+
|text |num|
+---------------------------+---+
|"ActivalteTime": "PT5M", |5 |
|"ReActivalteTime": "xy20$",|20 |
+---------------------------+---+
答案 1 :(得分:0)
这是一种稍微通用的方法,用于处理要使用regexp_extract
提取数字内容的任意列列表:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "A", "PT5M", "xy20$", "M100.1!"),
(2, "B", "QU6N", "uv%", "N200.2&")
).toDF("C1", "C2", "C3", "C4", "C5")
val colsToExtract = Seq("C3", "C4", "C5")
val colsRemained = df.columns diff colsToExtract
val prefix = "New"
df.select(colsRemained.map(col) ++ colsToExtract.map(c =>
regexp_extract(col(c), "([0-9.]+)", 1).as(s"${prefix}$c")): _*
).show
// +---+---+-----+-----+-----+
// | C1| C2|NewC3|NewC4|NewC5|
// +---+---+-----+-----+-----+
// | 1| A| 5| 20|100.1|
// | 2| B| 6| |200.2|
// +---+---+-----+-----+-----+