Question

例如，我有一个表，看起来像：

Student_Id||Index_date||logging_date||Index_date+30day
1            2017-02-11   2017-02-01    2017-03-12
1            2017-02-11   2017-02-05    2017-03-12
1            2017-02-11   2017-03-01    2017-03-12
1            2017-02-11   2017-03-02    2017-03-12
1            2017-02-11   2017-03-03    2017-03-12
1            2017-02-11   2017-03-03    2017-03-12
1            2017-02-11   2017-03-04    2017-03-12
1            2017-02-11   2017-03-05    2017-03-12
1            2017-02-11   2017-03-07    2017-03-12
1            2017-02-11   2017-03-18    2017-03-12

我想找到该学生的logging_date在index_date和index_date + 30之间的计数。

输出应为

student_id||in_30dayscount||notin_30dayscount
1             7             2

我尝试对其进行编码，但找不到解决方法。

我曾经使用hiveContext.sql（）。

但不允许。

是否可以在不使用SQL的情况下在pyspark中对此进行编码？

这是我的代码，这是错误的地方

test2=test1.filter(col('logging_date').between('index_date','index_date+30day'))\
           .groupBy('student_id') \
           .agg(countDistinct('logging_date').alias('count')\
           .show(5)

如何计算一列并在两个日期列之间应用过滤器

0 个答案: