例如,这是我的测试数据
test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(3, 3, 1, "2018-06-01", "Region A"),
(3, 1, 3, "2018-06-05", "Region A"),
])\
.toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()
我可以得到这样的汇总数据
test.groupBy("customerid", "location").agg(sum("price")).show()
但是我也想要百分比数据,像这样
+----------+--------+----------+
|customerid|location|sum(price)| percentage
+----------+--------+----------+
| 1|Region B| 2| 20%
| 1|Region A| 8| 80%
| 3|Region A| 1| 100%
| 2|Region B| 1| 100%
+----------+--------+----------+
我想知道
我只在How to get percentage of counts of a column after groupby in Pandas上找到一个熊猫的例子
更新:
在@Gordon Linoff的帮助下,我可以通过
获得百分比from pyspark.sql.window import Window
test.groupBy("customerid", "location").agg(sum("price"))\
.withColumn("percentage", col("sum(price)")/sum("sum(price)").over(Window.partitionBy(test['customerid']))).show()
答案 0 :(得分:1)
这回答了问题的原始版本。
在SQL中,您可以使用窗口函数:
select customerid, location, sum(price),
(sum(price) / sum(sum(price)) over (partition by customerid) as ratio
from t
group by customerid, location;
答案 1 :(得分:1)
这是您的问题的干净代码:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
(test.groupby("customerid", "location")
.agg(F.sum("price").alias("t_price"))
.withColumn("perc", F.col("t_price") / F.sum("t_price").over(Window.partitionBy("customerid")))