Question

我有一个包含以下架构的Dataframe：

root
 |-- id: string (nullable = true)
 |-- scoreMap: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- scores: struct (nullable = true)
 |    |    |    |-- SCORE1: double (nullable = true)
 |    |    |    |-- SCORE2: double (nullable = true)
 |    |    |    |-- SCORE3: double (nullable = true)
 |    |    |-- combinedScore: double (nullable = true)

示例数据：

id   scoreMap
id1   Map(key1 -> [[1.0, 3.2, 2.22], 2.42],   key2 -> [[3.0, 3.2, 1.2], 4.42])
id2   Map(key3 -> [[1.0, 3.2, 2.22], 3.1],   key3 -> [[3.0, 3.2, 1.2], 2.42])

我想1）。将scoreMap列转换为列表，2）。按combinedScore排序（desc）列表，3）。将排序列表中每个元素的索引添加到元素中。对于给定的示例，结果应为：

id   scoreList
id1   List([key2, [3.0, 3.2, 1.2], 4.42, 0], [key1,[1.0, 3.2, 2.22], 2.42, 1]])
id2   List([key3, [1.0, 3.2, 2.22], 3.1, 0],   [key3, [3.0, 3.2, 1.2], 2.42, 1])

我该如何做到这一点？

Answer 1

你可以这样做：

import sqlContext.implicits._
import org.apache.spark.sql.functions.udf
val mapToSortedList: Map[String,Scores] => List[(String,Scores)] = _.toList.sortBy(scores=>scores.combinedScore)
val mapToListUDF = udf(mapToSortedList)
val newDF = dF.withColumn("scoreMap",mapToListUDF('scoreMap))

我的回答没有包括添加的索引部分。不知道如何在不编写复杂代码的情况下实现它（使用自定义排序创建新的List类型，为每个元素添加排序索引）

我希望这至少可以作为一个起点

Spark数据帧：将具有StructType值的映射转换为已排序的列表

1 个答案: