如何用同一数据帧中其他列的实际列值替换一列中的字符串值?第2部分

时间:2019-08-05 13:26:01

标签: scala dataframe apache-spark

我在一列中有一些字符串值,我想将该列中的子字符串替换为其他列中的值,并用空格替换所有加号(如下所示)。

我有这些List[String]映射,这些映射是动态传递的,其中mapFrommapTo应该在索引中相互关联。

描述值:mapFrom: ["Child", "ChildAge", "ChildState"]

列名:mapTo: ["name", "age", "state"]

输入示例:

name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child

预期结果:

name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler

如何使用Spark Scala做到这一点?

当我从此处尝试解决方案时:How to replace string values in one column with actual column values from other columns in the same dataframe?

输出变为

name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler

2 个答案:

答案 0 :(得分:1)

我会使用map而不是内置的Spark函数。
不是最干净的,但是有效的解决方案

val data = Seq(
  ("tiffany", 10, "virginia", "ChildName + ChildAge + ChildState"),
  ("andrew", 11, "california", "ChildState + ChildName + ChildAge"),
  ("tyler", 12, "ohio", "ChildAge + ChildState + ChildName")
).toDF("name", "age", "state", "description")

定义编码器转换的模式

val schema = StructType(Seq(
  StructField("name", StringType),
  StructField("age", IntegerType),
  StructField("state", StringType),
  StructField("description", StringType)
))
val encoder = RowEncoder(schema)

逻辑本身

val res = data.map(row => {
  val desc = row.getAs[String]("description").replaceAll("\\s+", "").split("\\+")
  val sb = new StringBuilder()
  val map = desc.zipWithIndex.toMap.map(_.swap)

  map(0) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(1) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(2) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  Row(row.getAs[String]("name"), row.getAs[Int]("age"), row.getAs[String]("state"), sb.toString())
}) (encoder)

结果

res.show(false)
+-------+---+----------+---------------------+
|name   |age|state     |description          | 
+-------+---+----------+---------------------+
|tiffany|10 |virginia  |tiffany 10 virginia  |
|andrew |11 |california|california andrew 11 |
|tyler  |12 |ohio      |12 ohio tyler        |
+-------+---+----------+---------------------+

答案 1 :(得分:1)

这里的问题是由于包含Child的描述。这是ChildAgeChildState的子序列。由于使用了正则表达式,因此这意味着Child部分将被名称替换,从而产生奇怪的输出,例如tiffanyAgetiffanyState(请注意,Child部分是由名称代替)。

在这种情况下,有两种简单的解决方案,而无需更改输入:

  1. Child的正则表达式更改为使用超前:

    val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
    

    仅当其后有空格时,它才与Child匹配。

  2. Child放在列表的最后。这意味着ChildAgeChildState将首先匹配:

    val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\+ "
    

第一种选择的完整解决方案:

val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)

val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) => 
  df.withColumn("description", regexp_replace($"description", lit(from), to))
}