在pyspark数据框中分组并划分分组元素的数量

时间:2018-05-16 22:58:08

标签: python apache-spark pyspark

我在pyspark中有一个数据框,如下所示。我想在groupby

中执行categorydata frame列计数
df.show()
+--------+----+
|category| val|
+--------+----+
|    cat1|  13|
|    cat2|  12|
|    cat2|  14|
|    cat3|  23|
|    cat1|  20|
|    cat1|  10|
|    cat2|  30|
|    cat3|  11|
|    cat1|   7|
|    cat1|   8|
+--------+----+


res = df.groupBy('category').count()

res.show()

+--------+-----+
|category|count|
+--------+-----+
|    cat2|    3|
|    cat3|    2|
|    cat1|    5|
+--------+-----+

我得到了我想要的结果。现在我想计算类别的averagedata frame有3天的记录。我想计算这3天的平均数。

我想要的结果如下。我基本上想做count/no.of.days

+--------+-----+
|category|count|
+--------+-----+
|    cat2|    1|
|    cat3|    1|
|    cat1|    2|
+--------+-----+

我该怎么做?

1 个答案:

答案 0 :(得分:3)

我相信你想要的是

select m.date,
sum(case when event = 'Create Account' then 1 else 0) as create,
sum(case when event = 'Update Account' then 1 else 0) as updates,
sum(o.ordersrec) as orders
from @main as m
       inner join @orders as o on o.date = m.date
       inner join @events as e on e.date = m.date
group by m.date