我有以下格式的数据框rating
:
id | percentile
------------+-------------------------------------------------
0011111111 | {"80": 3438, "40": 1063, "60": 2119, "20": 620}
我需要转换为以下格式:
id | 80 | 40 | 60 | 20 |
------------+------+------+------+------+
0011111111 | 3438 | 1063 | 2119 | 620 |
我尝试使用以下代码,但没有帮助:
val schema = StructType(Seq(
StructField("80", DoubleType, true)
,StructField("60", DoubleType, true)
,StructField("40", DoubleType, true)
,StructField("20", DoubleType, true)
))
val rating1 = rating.withColumn("jsonData", from_json(col("percentile"), schema))
rating1.show()
+--------------------+--------------------+--------------------+
| cid| percentile| jsonData|
+--------------------+--------------------+--------------------+
| 0011111111|{"80": 3438, "40"...|[3438.0, 1063.0, ...|
如何获取80
,60
,40
,20
作为列
答案 0 :(得分:0)
scala> df.show(false)
+--------+-----------------------------------------------+
|id |percentile |
+--------+-----------------------------------------------+
|11111111|{"80": 3438, "40": 1063, "60": 2119, "20": 620}|
+--------+-----------------------------------------------+
//UDF to replace '{' and '}' from column percentile
scala> import org.apache.spark.sql.expressions.{UserDefinedFunction}
scala> val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))
scala> val df1 = df.withColumn("percentile", replace($"percentile", lit("\\{"))).withColumn("percentile", replace($"percentile", lit("\\}")))
scala> df1.show(false)
+--------+---------------------------------------------+
|id |percentile |
+--------+---------------------------------------------+
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|
+--------+---------------------------------------------+
//Mapping header with its value from column percentile
scala> val df2 = df1.withColumn("var", explode(split(col("percentile"), ", "))).withColumn("header", split(col("var"), ": ")(0)).withColumn("value", split(col("var"), ": ")(1)).drop("var")
scala> df2.show(false)
+--------+---------------------------------------------+------+-----+
|id |percentile |header|value|
+--------+---------------------------------------------+------+-----+
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"80" |3438 |
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"40" |1063 |
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"60" |2119 |
|11111111|"80": 3438, "40": 1063, "60": 2119, "20": 620|"20" |620 |
+--------+---------------------------------------------+------+-----+
//Final expected output
scala> df2.groupBy("id").pivot("header").agg(concat_ws("",collect_list(col("value")))).show()
+--------+----+----+----+----+
| id|"20"|"40"|"60"|"80"|
+--------+----+----+----+----+
|11111111| 620|1063|2119|3438|
+--------+----+----+----+----+
答案 1 :(得分:0)
您可以使用from_json
函数来实现它,我相信无需选择每个元素'map.getField(“ 80”),'map.getField(“ 40”),...并将它们作为array [Column]
val str = new StructType()
.add("80","string")
.add("40","string")
.add("60","string")
.add("20","string")
df.select('id, from_json('percentile,str).as("map"))
.select('id,
'map.getField("80"),
'map.getField("40"),
'map.getField("60"),
'map.getField("20")
).show()
+----------+-----------------------------------------------+
|id |percentile |
+----------+-----------------------------------------------+
|0011111111|{"80": 3438, "40": 1063, "60": 2119, "20": 620}|
+----------+-----------------------------------------------+
root
|-- id: string (nullable = true)
|-- percentile: string (nullable = true)
+----------+------+------+------+------+
| id|map.80|map.40|map.60|map.20|
+----------+------+------+------+------+
|0011111111| 3438| 1063| 2119| 620|
+----------+------+------+------+------+