Question

我知道这是一个非常具体的问题，通常不会在stackoverflow上发布此类问题，但是我处于一种奇怪的情况下，即想出一种幼稚的算法来解决我的问题，但无法解决实施它。因此，我的问题。

我有一个数据框

|user_id| action | day | week |
------------------------------
| d25as | AB     | 2   | 1    |
| d25as | AB     | 3   | 2    |
| d25as | AB     | 5   | 1    | 
| m3562 | AB     | 1   | 3    |
| m3562 | AB     | 7   | 1    |
| m3562 | AB     | 9   | 1    |
| ha42a | AB     | 3   | 2    |
| ha42a | AB     | 4   | 3    |
| ha42a | AB     | 5   | 1    |

我要创建一个数据框，该数据框的用户似乎每周至少3天，每月至少3周。 “天”列从1到31，“星期”列从1到4。

我想到的方式是：

split dataframe into 4 dataframes for each week
for every week_dataframe count days seen per user. 
count for every user how many weeks with >= 3 days they were seen.
only add to the new df the users seen for >= 3 such weeks.

现在我需要在Spark中以可扩展的方式执行此操作，而我不知道如何实现它。另外，如果您比我的幼稚方法对算法有更好的了解，那真的会有所帮助。

Answer 1

我建议使用groupBy函数，并通过where选择器选择用户：

df.groupBy('user_id', 'week')\
.agg(countDistinct('day').alias('days_per_week'))\
.where('days_per_week >= 3')\
.groupBy('user_id')\
.agg(count('week').alias('weeks_per_user'))\
.where('weeks_per_user >= 3' )

Answer 2

@eakotelnikov是正确的。

但是如果有人遇到错误

NameError：名称'countDistinct'未定义

然后请在执行eakotelnikov解决方案之前使用以下语句

from pyspark.sql.functions import *

为此问题添加另一种解决方案

tdf.registerTempTable("tbl")

outdf = spark.sql(""" 
select user_id , count(*) as weeks_per_user from
( select user_id , week , count(*) as days_per_week 
  from tbl 
  group by user_id , week  
  having count(*) >= 3
 ) x
group by user_id
having count(*) >= 3
""")

outdf.show()

PySpark-选择每周3天，每月3周的用户

2 个答案: