我所做的是我尝试groupby
和collect_list
:
数据:
id dates quantity
-- ----- -----
12 2012-03-02 1
32 2012-02-21 4
43 2012-03-02 4
5 2012-12-02 5
42 2012-12-02 7
21 2012-31-02 9
3 2012-01-02 5
2 2012-01-02 5
3 2012-01-02 7
2 2012-01-02 1
3 2012-01-02 3
21 2012-01-02 6
21 2012-03-23 5
21 2012-03-24 3
21 2012-04-25 1
21 2012-07-23 6
21 2012-01-02 8
代码:
new_df = df.groupby('id').agg(F.collect_list("dayid"),F.collect_list("quantity"))
答案 0 :(得分:0)
该代码似乎对我来说很好用,我唯一的问题是您使用dayid
作为collect_list
中的列,其余所有看起来都很好。
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
sc = spark.sparkContext
dataset1 = [{'id' : 12,'dates' : '2012-03-02','quantity' : 1},
{'id' : 32,'dates' : '2012-02-21','quantity' : 4},
{'id' : 12,'dates' : '2012-03-02','quantity' : 1},
{'id' : 32,'dates' : '2012-02-21','quantity' : 4}]
rdd1 = sc.parallelize(dataset1)
df1 = spark.createDataFrame(rdd1)
df1.show()
+----------+---+--------+
| dates| id|quantity|
+----------+---+--------+
|2012-03-02| 12| 1|
|2012-02-21| 32| 4|
|2012-03-02| 12| 1|
|2012-02-21| 32| 4|
+----------+---+--------+
new_df = df1.groupby('id').agg(F.collect_list("dayid"),F.collect_list("quantity"))
+---+----------------+----------------------+
| id|collect_list(id)|collect_list(quantity)|
+---+----------------+----------------------+
| 32| [32, 32]| [4, 4]|
| 12| [12, 12]| [1, 1]|
+---+----------------+----------------------+