Question

我有以下格式的数据框rating：

 id         | percentile
------------+-------------------------------------------------
 0011111111 | {"80": 3438, "40": 1063, "60": 2119, "20": 620}

我需要转换为以下格式：

 id         | 80   | 40   | 60   | 20   |
------------+------+------+------+------+
 0011111111 | 3438 | 1063 | 2119 | 620  |

我尝试使用以下代码，但没有帮助：

val schema = StructType(Seq(
       StructField("80", DoubleType, true)
      ,StructField("60", DoubleType, true)
      ,StructField("40", DoubleType, true)
      ,StructField("20", DoubleType, true)
    ))

val rating1 = rating.withColumn("jsonData", from_json(col("percentile"), schema))
rating1.show()

+--------------------+--------------------+--------------------+
|                 cid|          percentile|            jsonData|
+--------------------+--------------------+--------------------+
|          0011111111|{"80": 3438, "40"...|[3438.0, 1063.0, ...|

如何获取80，60，40，20作为列

Answer 1

scala> df.show(false)
+--------+-----------------------------------------------+
|id      |percentile                                     |
+--------+-----------------------------------------------+
|11111111|{"80": 3438, "40": 1063, "60": 2119, "20": 620}|
+--------+-----------------------------------------------+

//UDF to replace '{' and '}' from column percentile

scala> import org.apache.spark.sql.expressions.{UserDefinedFunction} 
scala> val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))
scala> val df1 = df.withColumn("percentile", replace($"percentile", lit("\\{"))).withColumn("percentile", replace($"percentile", lit("\\}")))

scala> df1.show(false)
+--------+---------------------------------------------+
|id      |percentile                                   |
+--------+---------------------------------------------+
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|
+--------+---------------------------------------------+

//Mapping header with its value from column percentile

scala> val df2 = df1.withColumn("var", explode(split(col("percentile"), ", "))).withColumn("header", split(col("var"), ": ")(0)).withColumn("value", split(col("var"), ": ")(1)).drop("var")

scala> df2.show(false)
+--------+---------------------------------------------+------+-----+
|id      |percentile                                   |header|value|
+--------+---------------------------------------------+------+-----+
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"80"  |3438 |
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"40"  |1063 |
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"60"  |2119 |
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"20"  |620  |
+--------+---------------------------------------------+------+-----+

//Final expected output
scala> df2.groupBy("id").pivot("header").agg(concat_ws("",collect_list(col("value")))).show()
+--------+----+----+----+----+
|      id|"20"|"40"|"60"|"80"|
+--------+----+----+----+----+
|11111111| 620|1063|2119|3438|
+--------+----+----+----+----+

Answer 2

您可以使用from_json函数来实现它，我相信无需选择每个元素'map.getField（“ 80”），'map.getField（“ 40”），...并将它们作为array [Column]

传递

  val str = new StructType()
    .add("80","string")
    .add("40","string")
    .add("60","string")
    .add("20","string")

  df.select('id, from_json('percentile,str).as("map"))
    .select('id,
      'map.getField("80"),
      'map.getField("40"),
      'map.getField("60"),
      'map.getField("20")
    ).show()

    +----------+-----------------------------------------------+
    |id        |percentile                                     |
    +----------+-----------------------------------------------+
    |0011111111|{"80": 3438, "40": 1063, "60": 2119, "20": 620}|
    +----------+-----------------------------------------------+

    root
     |-- id: string (nullable = true)
     |-- percentile: string (nullable = true)

    +----------+------+------+------+------+
    |        id|map.80|map.40|map.60|map.20|
    +----------+------+------+------+------+
    |0011111111|  3438|  1063|  2119|   620|
    +----------+------+------+------+------+

Scala：将数据框的dict列转换为表

2 个答案: