根据条件 pyspark 计算不同的列值

时间:2020-12-23 12:50:06

标签: python dataframe apache-spark pyspark apache-spark-sql

我有一列有 2 个可能的值:'users' 或 'not_users'

我想要做的是当这些值是“用户”时计算Distinct值

这是我正在使用的代码:

output = (df
           .withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(registration_date, 1), "Y-ww")'))
           .groupby('week') 
           .agg(f.countDistinct('customer_id').alias('count_total_users'),
                f.countDistinct('vegetables_customers').alias('count_vegetable_users')   
     
               )
         
          )

display(output)

这是输出(不需要):

Week        count_total_users      count_vegetable_users
2020-40            2345                        2
2020-41            5678                        2
2020-42            3345                        2
2020-43            5689                        2

所需的输出:

Week        count_total_users      count_vegetable_users
2020-40            2345                        457
2020-41            5678                        1987
2020-42            3345                        2308
2020-43            5689                        4000

这个所需的输出应该是它所属列内“用户”值的不同计数。

有什么线索吗?

1 个答案:

答案 0 :(得分:0)

df2 是您想要的结果吗?

df.show()
+----+-----------+--------------------+
|week|customer_id|vegetables_customers|
+----+-----------+--------------------+
|   1|          1|               users|
|   1|          2|           not_users|
|   1|          3|               users|
|   2|          1|           not_users|
|   2|          2|           not_users|
|   2|          3|               users|
+----+-----------+--------------------+

df2 = df.groupBy('week').agg(
    F.countDistinct('customer_id').alias('count_total_users'),
    F.countDistinct(
        F.when(
            F.col('vegetables_customers') == 'users', 
            F.col('customer_id')
        )
    ).alias('count_vegetable_users')
)

df2.show()
+----+-----------------+---------------------+
|week|count_total_users|count_vegetable_users|
+----+-----------------+---------------------+
|   1|                3|                    2|
|   2|                3|                    1|
+----+-----------------+---------------------+