我有一个pyspark数据框,在这里我想按某个索引分组,然后将每列中的所有值合并为每列一个list
。
示例输入:
id_1| id_2| id_3|timestamp|thing1|thing2|thing3
A | b | c |time_0 |1.2 |1.3 |2.5
A | b | c |time_1 |1.1 |1.5 |3.4
A | b | c |time_2 |2.2 |2.6 |2.9
A | b | d |time_0 |5.1 |5.5 |5.7
A | b | d |time_1 |6.1 |6.2 |6.3
A | b | e |time_0 |0.1 |0.5 |0.9
A | b | e |time_1 |0.2 |0.3 |0.6
示例输出:
id_1|id_2|id_3| timestamp |thing1 |thing2 |thing3
A |b | c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6|[2.5,3.4,2.9]
A |b | d |[time_0,time_1] |[5.1,6.1] |[5.5,6.2] |[5.7,6.3]
A |b | e |[time_0,time_1] |[0.1,0.2] |[0.5,0.3] |[0.9,0.6]
如何有效地做到这一点?
答案 0 :(得分:1)
也请按照上面的建议使用collect_list()
。
# Creating the DataFrame
df =sqlContext.createDataFrame([('A','b','c','time_0',1.2,1.3,2.5),('A','b','c','time_1',1.1,1.5,3.4),
('A','b','c','time_2',2.2,2.6,2.9),('A','b','d','time_0',5.1,5.5,5.7),
('A','b', 'd','time_1',6.1,6.2,6.3),('A','b','e','time_0',0.1,0.5,0.9),
('A','b', 'e','time_1',0.2,0.3,0.6)],
['id_1','id_2','id_3','timestamp','thing1','thing2','thing3'])
df.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
| A| b| c| time_0| 1.2| 1.3| 2.5|
| A| b| c| time_1| 1.1| 1.5| 3.4|
| A| b| c| time_2| 2.2| 2.6| 2.9|
| A| b| d| time_0| 5.1| 5.5| 5.7|
| A| b| d| time_1| 6.1| 6.2| 6.3|
| A| b| e| time_0| 0.1| 0.5| 0.9|
| A| b| e| time_1| 0.2| 0.3| 0.6|
+----+----+----+---------+------+------+------+
除了使用agg()
之外,您还可以编写熟悉的SQL
语法对其进行操作,但是首先我们必须将DataFrame
注册为临时SQL
视图-< / p>
df.createOrReplaceTempView("df_view")
df = spark.sql("""select id_1, id_2, id_3,
collect_list(timestamp) as timestamp,
collect_list(thing1) as thing1,
collect_list(thing2) as thing2,
collect_list(thing3) as thing3
from df_view
group by id_1, id_2, id_3""")
df.show(truncate=False)
+----+----+----+------------------------+---------------+---------------+---------------+
|id_1|id_2|id_3|timestamp |thing1 |thing2 |thing3 |
+----+----+----+------------------------+---------------+---------------+---------------+
|A |b |d |[time_0, time_1] |[5.1, 6.1] |[5.5, 6.2] |[5.7, 6.3] |
|A |b |e |[time_0, time_1] |[0.1, 0.2] |[0.5, 0.3] |[0.9, 0.6] |
|A |b |c |[time_0, time_1, time_2]|[1.2, 1.1, 2.2]|[1.3, 1.5, 2.6]|[2.5, 3.4, 2.9]|
+----+----+----+------------------------+---------------+---------------+---------------+
注意: """
已用于具有多行语句,以提高可见性和整洁度。使用简单的'select id_1 ....'
,如果您尝试将语句分散到多行上将无法使用。不用说,最终结果将是相同的。
答案 1 :(得分:0)
以下是示例github TestExample1
exampleDf = self.spark.createDataFrame(
[('A', 'b', 'c', 'time_0', 1.2, 1.3, 2.5),
('A', 'b', 'c', 'time_1', 1.1, 1.5, 3.4),
],
("id_1", "id_2", "id_3", "timestamp", "thing1", "thing2", "thing3"))
exampleDf.show()
ans = exampleDf.groupBy(col("id_1"), col("id_2"), col("id_3")) \
.agg(collect_list(col("timestamp")),
collect_list(col("thing1")),
collect_list(col("thing2")))
ans.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
| A| b| c| time_0| 1.2| 1.3| 2.5|
| A| b| c| time_1| 1.1| 1.5| 3.4|
+----+----+----+---------+------+------+------+
+----+----+----+-----------------------+--------------------+--------------------+
|id_1|id_2|id_3|collect_list(timestamp)|collect_list(thing1)|collect_list(thing2)|
+----+----+----+-----------------------+--------------------+--------------------+
| A| b| c| [time_0, time_1]| [1.2, 1.1]| [1.3, 1.5]|
+----+----+----+-----------------------+--------------------+--------------------+