将Spark DataFrame映射转换为`{“ Key”:key,“ Value”:value}`的映射数组

时间:2019-10-07 16:52:50

标签: apache-spark apache-spark-sql

如何获取结构如下的Spark DataFrame:

val sourcedf = spark.createDataFrame(
  List(
    Row(Map("AL" -> "Alabama", "AK" -> "Alaska").asJava),
    Row(Map("TX" -> "Texas", "FL" -> "Florida", "NJ" -> "New Jersey").asJava)
  ).asJava, StructType(
    StructField("my_map", MapType(StringType, StringType, false)) ::
    Nil))

或以文本形式,sourcedf.show(false)显示:

+----------------------------------------------+
|my_map                                        |
+----------------------------------------------+
|[AL -> Alabama, AK -> Alaska]                 |
|[TX -> Texas, FL -> Florida, NJ -> New Jersey]|
+----------------------------------------------+

并以编程方式转换为这种结构:

val targetdf = spark.createDataFrame(
  List(
    Row(List(Map("Key" -> "AL", "Value" -> "Alabama"), Map("Key" -> "AK", "Value" -> "Alaska")).asJava),
    Row(List(Map("Key" -> "TX", "Value" -> "Texas"), Map("Key" -> "FL", "Value" -> "Florida"), Map("Key" -> "NJ", "Value" -> "New Jersey")).asJava)
  ).asJava, StructType(
    StructField("my_list", ArrayType(MapType(StringType, StringType, false), false)) ::
    Nil))

或以文本形式,targetdf.show(false)显示:

+----------------------------------------------------------------------------------------------+
|my_list                                                                                       |
+----------------------------------------------------------------------------------------------+
|[[Key -> AL, Value -> Alabama], [Key -> AK, Value -> Alaska]]                                 |
|[[Key -> TX, Value -> Texas], [Key -> FL, Value -> Florida], [Key -> NJ, Value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+```

1 个答案:

答案 0 :(得分:0)

因此,在使用Scala时,我无法弄清楚如何使用提供的编码器处理java.util.Map,我可能不得不自己写一个,而且我认为这工作太多。

但是,我可以看到两种方法来执行此操作,而无需转换为java.util.Map并使用scala.collection.immutable.Map

您可以转换为Dataset[Obj]和flatMap。

case class Foo(my_map: Map[String, String])

case class Bar(my_list: List[Map[String, String]])

implicit val encoder = ExpressionEncoder[List[Map[String, String]]]

val ds: Dataset[Foo] = sourcedf.as[Foo]
val output: Dataset[Bar] = ds.map(x => Bar(x.my_map.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList))
output.show(false)

或者您可以使用UDF

val mapToList: Map[String, String] => List[Map[String, String]] = {
  x => x.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList
}
val mapToListUdf: UserDefinedFunction = udf(mapToList)
val output: Dataset[Row] = sourcedf.select(mapToListUdf($"my_map").as("my_list"))
output.show(false)

两个输出

+----------------------------------------------------------------------------------------------+
|my_list                                                                                        |
+----------------------------------------------------------------------------------------------+
|[[key -> AL, value -> Alabama], [key -> AK, value -> Alaska]]                                 |
|[[key -> TX, value -> Texas], [key -> FL, value -> Florida], [key -> NJ, value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+