输入数据框:
id,page,location,trlmonth
1,mobile,chn,08/2018
2,product,mdu,09/2018
3,product,mdu,09/2018
4,mobile,chn,08/2018
5,book,delhi,10/2018
7,music,ban,11/2018
输出DataFrame:
userdetail,count
mobile-chn-08/2018,2
product-mdu-09/2018,2
book-delhi-10/2018,1
music-ban-11/2018,1
我尝试将单个列合并为一个,但如何将多个列合并为一个?
from pyspark.sql import functions as F
df2 = (df
.groupby("id")
.agg(F.concat_ws("-", F.sort_array(F.collect_list("product"))).alias("products"))
.groupby("products")
.agg(F.count("id")).alias("count"))
答案 0 :(得分:0)
我们可以通过userdetail列进行分组并获取计数。试试这个,
>>> df.orderBy('trlmonth').groupby('page','location','trlmonth').count().select(F.concat_ws('-','page','location','trlmonth').alias('user_detail'),'count').show()
+-------------------+-----+
| user_detail|count|
+-------------------+-----+
| mobile-chn-08/2018| 2|
|product-mdu-09/2018| 2|
| book-delhi-10/2018| 1|
| music-ban-11/2018| 1|
+-------------------+-----+