我有一个DataFrame
,就像这样:
+---------+-------------+--------------------+--------+
| ID | reg_num| reg_typ|reg_code|
+---------+-------------+--------------------+--------+
|523528690| 134886307000|Chamber of Commer | 14246|
|523528690|2015 / 369956|Government Gazett | 14225|
|523528690| 997253630|Tax Registration | 14259|
|523528691| 997253633|Tax Doc | 14250|
|523528691| 997253634|Tax File | 14251|
|523528691| 997253635|Tax Data | 14252|
|523528691| 997253636|Tax Monitor | 14253|
+---------+-------------+--------------------+--------+
现在我正在尝试使用以下格式实现输出:
+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+
| ID | reg_num| reg_typ|reg_code| reg_1 | reg_2 | reg_3 | reg_4 |
+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+
|523528690| 134886307000|Chamber of Commer | 14246| 134886307000|2015 / 369956| 997253630 | null |
|523528690|2015 / 369956|Government Gazett | 14225|134886307000 |2015 / 369956|997253630 |null |
|523528690| 997253630|Tax Registration | 14259| 134886307000|2015 / 369956| 997253630 | null |
|523528691| 997253633|Tax Doc | 14250| 997253633| 997253634| 997253635| 997253636|
|523528691| 997253634|Tax File | 14251| 997253633| 997253634| 997253635| 997253636|
|523528691| 997253635|Tax Data | 14252| 997253633| 997253634| 997253635| 997253636|
|523528691| 997253636|Tax Monitor | 14253| 997253633| 997253634| 997253635| 997253636|
+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+
我见过像pivot这样的预定义功能,但它似乎不适合我的情况。
我正在使用Spark版本1.6和Scala版本2.10.5。
帮助是appriciated !!
答案 0 :(得分:2)
枢轴是要走的路,但背后的逻辑并不明显:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(523528690, "134886307000", "Chamber of Commer", 14246),
(523528690, "2015 / 369956", "Government Gazett", 14225),
(523528690, "997253630", "Tax Registration", 14259),
(523528691, "997253633", "Tax Doc", 14250),
(523528691, "997253634", "Tax File", 14251),
(523528691, "997253635", "Tax Data", 14252),
(523528691, "997253636", "Tax Monitor", 14253)).toDF("id", "reg_num", "reg_type", "reg_code")
val w = Window.partitionBy("id").orderBy("reg_num")
df.show
// +---------+-------------+-----------------+--------+
// | id| reg_num| reg_type|reg_code|
// +---------+-------------+-----------------+--------+
// |523528690| 134886307000|Chamber of Commer| 14246|
// |523528690|2015 / 369956|Government Gazett| 14225|
// |523528690| 997253630| Tax Registration| 14259|
// |523528691| 997253633| Tax Doc| 14250|
// |523528691| 997253634| Tax File| 14251|
// |523528691| 997253635| Tax Data| 14252|
// |523528691| 997253636| Tax Monitor| 14253|
// +---------+-------------+-----------------+--------+
val df2 = df.join(df.withColumn("rn", row_number.over(w)).groupBy("id").pivot("rn").agg(first("reg_num")), Seq("id"))
df2.show
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+
// | id| reg_num| reg_type|reg_code| 1| 2| 3| 4|
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+
// |523528690| 134886307000|Chamber of Commer| 14246|134886307000|2015 / 369956|997253630| null|
// |523528690|2015 / 369956|Government Gazett| 14225|134886307000|2015 / 369956|997253630| null|
// |523528690| 997253630| Tax Registration| 14259|134886307000|2015 / 369956|997253630| null|
// |523528691| 997253633| Tax Doc| 14250| 997253633| 997253634|997253635|997253636|
// |523528691| 997253634| Tax File| 14251| 997253633| 997253634|997253635|997253636|
// |523528691| 997253635| Tax Data| 14252| 997253633| 997253634|997253635|997253636|
// |523528691| 997253636| Tax Monitor| 14253| 997253633| 997253634|997253635|997253636|
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+