PySpark - 根据日期列获取季度的周数

时间:2021-03-18 18:08:15

标签: python pandas apache-spark pyspark

我正在创建一个包含日期、week_of_month、week_of_quarter 和 week_of_year 列的 spark 数据框,但我不知道如何在 pyspark 中获取季度中的一周。到目前为止,这就是我所做的:

df = spark.createDataFrame([('2017-01-01', '2021-01-01')], ['start', 'end'])
df = df.withColumn('start', f.col('start').cast('date'))\
        .withColumn('end', f.col('end').cast('date'))

df2 = df.withColumn('dates', f.explode(f.expr('sequence(start, end, interval 1 day)'))).drop('start', 'end')

df_week = df2 \
    .withColumn('calwek_week_of_month', f.date_format(f.col("dates"), "W"))\
    .withColumn('calwek_week_of_year', f.weekofyear(f.col("dates")))\
    .sort(['dates'])

我期望的结果示例(不过,我需要 2017-01-01 和 2021-01-01 之间的所有日期):

  dates         week_of_month      week_of_year    week_of_quarter
2017-01-01            1                  1                1
2017-01-02            1                  1                1
2017-01-03            1                  1                1
2017-01-04            1                  1                1
... ... ... ...
2017-03-30            5                  13               13
2017-03-31            5                  13               13
2017-04-01            1                  13               13
2017-04-02            2                  14               1
... ... ... ...
2017-04-14            3                  15               2
2017-04-15            3                  15               2
2017-04-16            4                  16               3
2017-04-17            4                  16               3

有人可以帮我在 pyspark 中计算和创建列 week_of_quarter 吗?

1 个答案:

答案 0 :(得分:0)

简介

这篇文章解释了如何创建一个新列来计算一个季度的周数。

商业案例

在许多情况下,业务分析师希望跟踪按季度汇总但细分为每个季度内的一周的一段时间内的趋势。

步骤

在公式编辑器中输入以下代码:

IF(WEEKNUM(TODAY())<=13,WEEKNUM(TODAY())

ELSE IF(WEEKNUM(TODAY())<=26,WEEKNUM(TODAY())-13,

ELSE IF(WEEKNUM(TODAY())<=39,WEEKNUM(TODAY())-26,

ELSE IF(WEEKNUM(TODAY())<=52,WEEKNUM(TODAY())-39,0))

IN SPARK

我们将使用 when 和 else 将不同的条件转换为 spark dataframe 列函数:

 df_week = given_df \
        .withColumn('calwek_week_of_month', f.date_format(f.col("dates"), "W")) \
        .withColumn('calwek_week_of_year', f.weekofyear(f.col("dates"))) \
        .withColumn("week_quarters",
                    when(f.col("calwek_week_of_year") <= 13, f.col("calwek_week_of_year")).
                    otherwise(when(f.col("calwek_week_of_year") <= 26, f.col("calwek_week_of_year") - 13)
                        .otherwise(
                        when(f.col("calwek_week_of_year") <= 39, f.col("calwek_week_of_year") - 26).otherwise(
                            when(f.col("calwek_week_of_year") <= 59, f.col("calwek_week_of_year") - 39).otherwise(0))))).sort(['dates'])


请在 when / else 子句中验证 () 我没有验证:)

这是一个输出示例: enter image description here