在Spark数据框中-如何映射类型为列表的一列

时间:2020-05-28 06:23:34

标签: apache-spark-sql

val df1 = Seq(
("a",2,"c"),
("a",2,"c"),
("a",2,"c"),
("b",2,"d"),
("b",2,"d")
).toDF("col1","col2","col3").groupBy("col2").agg(
      collect_list("col1").as("col1"),
      collect_list("col3").as("col3")
    )
df1.show

输出:

+----+---------------+---------------+
|col2|           col1|           col3|
+----+---------------+---------------+
|   2|[a, a, b, b, a]|[c, c, d, d, c]|
+----+---------------+---------------+

如何获取下表?(将列表中每个元素左侧的列名称连接起来)

+----+---------------+---------------+
|col2|           col1|           col3|
+----+---------------+---------------+
|   2|[col1-a, col1-a, col1-a, col1-b, col1-b]  |  [col3-c, col3-c, col3-c, col3-d, col3-d]|
+----+---------------+---------------+

1 个答案:

答案 0 :(得分:0)

尝试使用以下方法解决此问题-

1。阅读输入内容

val df1 = Seq(
      ("a", 2, "c"),
      ("a", 2, "c"),
      ("a", 2, "c"),
      ("b", 2, "d"),
      ("b", 2, "d")
    ).toDF("col1", "col2", "col3").groupBy("col2").agg(
      collect_list("col1").as("col1"),
      collect_list("col3").as("col3")
    )
    df1.show(false)
    df1.printSchema()

输出-

+----+---------------+---------------+
|col2|col1           |col3           |
+----+---------------+---------------+
|2   |[a, a, a, b, b]|[c, c, c, d, d]|
+----+---------------+---------------+

root
 |-- col2: integer (nullable = false)
 |-- col1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col3: array (nullable = true)
 |    |-- element: string (containsNull = true)

2。使用Transform函数处理数组值

 val transform = (str: String) => expr(s"TRANSFORM($str, x -> concat('$str-', x)) as $str")
    val cols = df1.schema.map(f => if (f.dataType.isInstanceOf[ArrayType]) {
      transform(f.name)
    } else expr(f.name))

    df1.select(cols: _*).show(false)

输出-

+----+----------------------------------------+----------------------------------------+
|col2|col1                                    |col3                                    |
+----+----------------------------------------+----------------------------------------+
|2   |[col1-a, col1-a, col1-a, col1-b, col1-b]|[col3-c, col3-c, col3-c, col3-d, col3-d]|
+----+----------------------------------------+----------------------------------------+