在Spark中转置

时间:2017-05-30 10:53:31

标签: scala apache-spark apache-spark-sql

我有一个DataFrame,就像这样:

+---------+-------------+--------------------+--------+
|     ID  |      reg_num|             reg_typ|reg_code|
+---------+-------------+--------------------+--------+
|523528690| 134886307000|Chamber of Commer   |   14246|
|523528690|2015 / 369956|Government Gazett   |   14225|
|523528690|    997253630|Tax Registration    |   14259|
|523528691|    997253633|Tax Doc             |   14250|
|523528691|    997253634|Tax File            |   14251|
|523528691|    997253635|Tax Data            |   14252|
|523528691|    997253636|Tax Monitor         |   14253|
+---------+-------------+--------------------+--------+

现在我正在尝试使用以下格式实现输出:

+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+
|     ID  |      reg_num|             reg_typ|reg_code|      reg_1  |      reg_2  |      reg_3  |      reg_4  |
+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+
|523528690| 134886307000|Chamber of Commer   |   14246| 134886307000|2015 / 369956| 997253630   | null        |
|523528690|2015 / 369956|Government Gazett   |   14225|134886307000 |2015 / 369956|997253630    |null         |
|523528690|    997253630|Tax Registration    |   14259| 134886307000|2015 / 369956| 997253630   | null        |
|523528691|    997253633|Tax Doc             |   14250|    997253633|    997253634|    997253635|    997253636|
|523528691|    997253634|Tax File            |   14251|    997253633|    997253634|    997253635|    997253636|
|523528691|    997253635|Tax Data            |   14252|    997253633|    997253634|    997253635|    997253636|
|523528691|    997253636|Tax Monitor         |   14253|    997253633|    997253634|    997253635|    997253636|
+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+

我见过像pivot这样的预定义功能,但它似乎不适合我的情况。

我正在使用Spark版本1.6和Scala版本2.10.5。

帮助是appriciated !!

1 个答案:

答案 0 :(得分:2)

枢轴是要走的路,但背后的逻辑并不明显:

import org.apache.spark.sql.expressions.Window

val df = Seq(
  (523528690, "134886307000", "Chamber of Commer", 14246),
  (523528690, "2015 / 369956", "Government Gazett", 14225),
  (523528690, "997253630", "Tax Registration", 14259),
  (523528691, "997253633", "Tax Doc", 14250),
  (523528691, "997253634", "Tax File", 14251),
  (523528691, "997253635", "Tax Data", 14252),
  (523528691, "997253636", "Tax Monitor", 14253)).toDF("id", "reg_num", "reg_type", "reg_code")

val w = Window.partitionBy("id").orderBy("reg_num")
df.show
// +---------+-------------+-----------------+--------+
// |       id|      reg_num|         reg_type|reg_code|
// +---------+-------------+-----------------+--------+
// |523528690| 134886307000|Chamber of Commer|   14246|
// |523528690|2015 / 369956|Government Gazett|   14225|
// |523528690|    997253630| Tax Registration|   14259|
// |523528691|    997253633|          Tax Doc|   14250|
// |523528691|    997253634|         Tax File|   14251|
// |523528691|    997253635|         Tax Data|   14252|
// |523528691|    997253636|      Tax Monitor|   14253|
// +---------+-------------+-----------------+--------+


val df2 = df.join(df.withColumn("rn", row_number.over(w)).groupBy("id").pivot("rn").agg(first("reg_num")), Seq("id"))
df2.show
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+
// |       id|      reg_num|         reg_type|reg_code|           1|            2|        3|        4|
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+
// |523528690| 134886307000|Chamber of Commer|   14246|134886307000|2015 / 369956|997253630|     null|
// |523528690|2015 / 369956|Government Gazett|   14225|134886307000|2015 / 369956|997253630|     null|
// |523528690|    997253630| Tax Registration|   14259|134886307000|2015 / 369956|997253630|     null|
// |523528691|    997253633|          Tax Doc|   14250|   997253633|    997253634|997253635|997253636|
// |523528691|    997253634|         Tax File|   14251|   997253633|    997253634|997253635|997253636|
// |523528691|    997253635|         Tax Data|   14252|   997253633|    997253634|997253635|997253636|
// |523528691|    997253636|      Tax Monitor|   14253|   997253633|    997253634|997253635|997253636|
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+