我在pyspark
中有一个数据框,如下所示。我想在groupby
category
和data frame
列计数
df.show()
+--------+----+
|category| val|
+--------+----+
| cat1| 13|
| cat2| 12|
| cat2| 14|
| cat3| 23|
| cat1| 20|
| cat1| 10|
| cat2| 30|
| cat3| 11|
| cat1| 7|
| cat1| 8|
+--------+----+
res = df.groupBy('category').count()
res.show()
+--------+-----+
|category|count|
+--------+-----+
| cat2| 3|
| cat3| 2|
| cat1| 5|
+--------+-----+
我得到了我想要的结果。现在我想计算类别的average
。 data frame
有3天的记录。我想计算这3天的平均数。
我想要的结果如下。我基本上想做count/no.of.days
+--------+-----+
|category|count|
+--------+-----+
| cat2| 1|
| cat3| 1|
| cat1| 2|
+--------+-----+
我该怎么做?
答案 0 :(得分:3)
我相信你想要的是
select m.date,
sum(case when event = 'Create Account' then 1 else 0) as create,
sum(case when event = 'Update Account' then 1 else 0) as updates,
sum(o.ordersrec) as orders
from @main as m
inner join @orders as o on o.date = m.date
inner join @events as e on e.date = m.date
group by m.date