我有一个这样的数据集
http://localhost:8081/MiniPro/first.html
我可以像这样汇总每个地区的每个客户的订单:
test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(4, 1, 1, "2018-06-05", "Region C"),
(5, 3, 2, "2018-06-03", "Region D"),
(6, 1, 2, "2018-06-03", "Region A"),
(7, 4, 4, "2018-06-03", "Region A"),
(8, 4, 4, "2018-06-03", "Region B"),
(9, 5, 4, "2018-06-03", "Region A"),
(10, 5, 4, "2018-06-03", "Region B"),
])\
.toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()
现在,我想代替temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0)
temp_result.show()
或sum
,而是通过确定该值是否存在(即0或1)来简单地汇总数据,就像这样
我可以通过
获得以上结果count
但是有没有更简单的方法来获取它?
答案 0 :(得分:1)
您基本上已经到了-只需稍作调整即可获得所需的结果。在您的汇总中,添加计数比较并将布尔值转换为整数(如果需要的话):
temp_result = test.groupBy("customerid")\
.pivot("location")\
.agg((count("orderid")>0).cast("integer"))\
.na.fill(0)
temp_result.show()
结果为:
+----------+--------+--------+--------+--------+
|customerid|Region A|Region B|Region C|Region D|
+----------+--------+--------+--------+--------+
| 5| 1| 1| 0| 0|
| 1| 1| 1| 1| 0|
| 3| 0| 0| 0| 1|
| 2| 0| 1| 0| 0|
| 4| 1| 1| 0| 0|
+----------+--------+--------+--------+--------+
如果出现火花错误,则可以改用此解决方案,该解决方案通过另外的步骤进行计数比较:
temp_result = test.groupBy("customerId", "location")\
.agg(count("orderid").alias("count"))\
.withColumn("count", (col("count")>0).cast("integer"))\
.groupby("customerId")\
.pivot("location")\
.agg(sum("count")).na.fill(0)
temp_result.show()