我在一列中有一些字符串值,我想将该列中的子字符串替换为其他列中的值,并用空格替换所有加号(如下所示)。
我有这些List[String]
映射,这些映射是动态传递的,其中mapFrom
和mapTo
应该在索引中相互关联。
描述值:mapFrom: ["Child Name", "Child Age", "Child State"]
列名:mapTo: ["name", "age", "state"]
输入示例:
name, age, state, description
tiffany, 10, virginia, Child Name + Child Age + Child State
andrew, 11, california, Child State + Child Name + Child Age
tyler, 12, ohio, Child Age + Child State + Child Name
预期结果:
name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler
如何使用Spark Scala做到这一点?
答案 0 :(得分:1)
您要使用regexp_replace
用另一列中的值替换子字符串。
首先,压缩两个列表(在这里我将从+
到空格的更改添加到两个lsits中,但是可以分别完成):
val mapFrom = List("Child Name", "Child Age", "Child State") :+ " + "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)
假设输入数据帧为df
,则将所有子字符串替换为其各自的值,如下所示:
val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) =>
df.withColumn("description", regexp_replace($"description", lit(from), to))
}
使用提供的输入数据,结果符合预期:
+-------+---+----------+------------------------+
|name |age|state |description |
+-------+---+----------+------------------------+
|tiffany|10 |virginia |tiffany + 10 + virginia |
|andrew |11 |california|california + andrew + 11|
|tyler |12 |ohio |12 + ohio + tyler |
+-------+---+----------+------------------------+