Pyspark:通过取消检查值是否存在(不计数或总和)来聚合数据

时间:2018-09-07 06:07:29

标签: sql pyspark

我有一个这样的数据集

http://localhost:8081/MiniPro/first.html

我可以像这样汇总每个地区的每个客户的订单:

test = spark.createDataFrame([
    (0, 1, 5, "2018-06-03", "Region A"),
    (1, 1, 2, "2018-06-04", "Region B"),
    (2, 2, 1, "2018-06-03", "Region B"),
    (4, 1, 1, "2018-06-05", "Region C"),
    (5, 3, 2, "2018-06-03", "Region D"),
    (6, 1, 2, "2018-06-03", "Region A"),
    (7, 4, 4, "2018-06-03", "Region A"),
    (8, 4, 4, "2018-06-03", "Region B"),
    (9, 5, 4, "2018-06-03", "Region A"),
    (10, 5, 4, "2018-06-03", "Region B"),
])\
  .toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

enter image description here

现在,我想代替temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0) temp_result.show() sum,而是通过确定该值是否存在(即0或1)来简单地汇总数据,就像这样

enter image description here


我可以通过

获得以上结果
count

但是有没有更简单的方法来获取它?

1 个答案:

答案 0 :(得分:1)

您基本上已经到了-只需稍作调整即可获得所需的结果。在您的汇总中,添加计数比较并将布尔值转换为整数(如果需要的话):

temp_result = test.groupBy("customerid")\
                  .pivot("location")\
                  .agg((count("orderid")>0).cast("integer"))\
                  .na.fill(0)

temp_result.show()

结果为:

+----------+--------+--------+--------+--------+
|customerid|Region A|Region B|Region C|Region D|
+----------+--------+--------+--------+--------+
|         5|       1|       1|       0|       0|
|         1|       1|       1|       1|       0|
|         3|       0|       0|       0|       1|
|         2|       0|       1|       0|       0|
|         4|       1|       1|       0|       0|
+----------+--------+--------+--------+--------+

如果出现火花错误,则可以改用此解决方案,该解决方案通过另外的步骤进行计数比较:

temp_result = test.groupBy("customerId", "location")\
                  .agg(count("orderid").alias("count"))\
                  .withColumn("count", (col("count")>0).cast("integer"))\
                  .groupby("customerId")\
                  .pivot("location")\
                  .agg(sum("count")).na.fill(0)

temp_result.show()