根据具有爆炸和数组的条件复制行

时间:2019-10-21 19:29:11

标签: python pyspark apache-spark-sql

我的想法是计算月底未交货的数量。这是我的数据帧df。 (SOD:交货开始,EOD /交货结束):

Reference   Start Date   StartTimestamp  EndDate     EndTimestamp        
1           2/15/2019    SOD             4/18/2019   EOD                                       
2           2/16/2019    SOD             2/23/2019   EOD                                        
3           2/17/2019    SOD             3/4/2019    EOD                                        
4           3/1/2019     SOD             Null        Null  



from pyspark.sql.functions import col, when, explode
from pyspark.sql import functions as F

    df1 = df.withColumn("EndOfTheMonth", F.lastday("Start Date"))
            .withColumn("IsDeliveryOpen", when((col("Start Date") <= col("EndOfTheMonth")) & ((col("EndDate") >= col("EndOfTheMonth")) | 
(col("EndTask").isNull())),1).otherwise(0))

为此,我想在交付的End Date高于“ EndOfTheMonth”时每月在新数据框中复制行。

我尝试使用explode来执行此操作,但是我不知道如何使用此功能:

df2 = (df1.filter(col("IsDeliveryOpen") == 1)
         .select("Reference").explode()
         .withColumn("EndOfTheMonth", F.add_months(lastday("StartDate"), 1))

所需的输出应能够groupBy EndOfTheMonth

Reference   Start Date   StartTimestamp  EndDate     EndTimestamp EndOfTheMonth IsDeliveryOpen      
1           8/15/2019    SOD             9/18/2019   EOD          8/31/2019     1
1           8/15/2019    SOD             9/18/2019   EOD          9/30/2019     0
2           8/16/2019    SOD             8/23/2019   EOD          8/31/2019     0                         
3           6/17/2019    SOD             8/4/2019    EOD          6/30/2019     1
3           6/17/2019    SOD             8/4/2019    EOD          7/31/2019     1
3           6/17/2019    SOD             8/4/2019    EOD          8/31/2019     0
4           8/1/2019     SOD             Null        Null         8/31/2019     1
4           8/1/2019     SOD             Null        Null         9/30/2019     1

1 个答案:

答案 0 :(得分:0)

对于 Spark 2.4.0 + ,您可以使用 sequence + transform +爆炸以创建新的执行此任务的行:

from pyspark.sql.functions import expr

df_new = (df
    .withColumn('s_date', expr("last_day(to_date(StartDate, 'M/d/yyyy'))"))
    .withColumn('e_date', expr("last_day(IFNULL(to_date(EndDate, 'M/d/yyyy'), add_months(current_date(),-1)))"))
    .withColumn('EndOfTheMonth', expr('''
          explode_outer(transform(
            sequence(0, int(months_between(e_date, s_date))), i -> add_months(s_date,i)
          ))
     '''))
    .withColumn('IsDeliveryOpen', expr("IF(e_date > EndOfTheMonth or EndDate is Null, 1, 0)"))
)

df_new.show()
+---------+---------+--------------+---------+------------+----------+----------+-------------+--------------+
|Reference|StartDate|StartTimestamp|  EndDate|EndTimestamp|    s_date|    e_date|EndOfTheMonth|IsDeliveryOpen|
+---------+---------+--------------+---------+------------+----------+----------+-------------+--------------+
|        1|8/15/2019|           SOD|9/18/2019|         EOD|2019-08-31|2019-09-30|   2019-08-31|             1|
|        1|8/15/2019|           SOD|9/18/2019|         EOD|2019-08-31|2019-09-30|   2019-09-30|             0|
|        2|8/16/2019|           SOD|8/23/2019|         EOD|2019-08-31|2019-08-31|   2019-08-31|             0|
|        3|6/17/2019|           SOD| 8/4/2019|         EOD|2019-06-30|2019-08-31|   2019-06-30|             1|
|        3|6/17/2019|           SOD| 8/4/2019|         EOD|2019-06-30|2019-08-31|   2019-07-31|             1|
|        3|6/17/2019|           SOD| 8/4/2019|         EOD|2019-06-30|2019-08-31|   2019-08-31|             0|
|        4| 8/1/2019|           SOD|     null|        null|2019-08-31|2019-09-30|   2019-08-31|             1|
|        4| 8/1/2019|           SOD|     null|        null|2019-08-31|2019-09-30|   2019-09-30|             1|
+---------+---------+--------------+---------+------------+----------+----------+-------------+--------------+

df_new = df_new.drop('s_date', 'e_date')

工作方式:

  1. StartDate EndDate 转换为DateType,并将其值转换为同一月的最后一天( s_date e_date )。如果 EndDate 为NULL,则将其值设置为从current_date

  2. 开始的上个月的last_day。
  3. 计算上述两个日期之间的#月数,然后创建一个序列(0,#months)并将其转换为StartDate之间的一个月数数组(EndOfTheMonth)和EndDate(含)

  4. 使用explode_outer生成上述数组中所有月份的行

  5. 相应地计算IsDeliveryOpen标志。我删除了代码中的StartDate <= EndOfTheMonth,因为根据 EndOfTheMonth 的计算方式,它始终为真。

注意:以上内容也可以写成一条SQL语句:

df.createOrReplaceTempView('t_df')

spark.sql('''

    WITH d AS (
        SELECT *
             , last_day(to_date(StartDate, 'M/d/yyyy')) as s_date
             , last_day(IFNULL(to_date(EndDate, 'M/d/yyyy'),add_months(current_date(),-1))) as e_date
        FROM t_df
    )
    SELECT d.*
         , m.EndOfTheMonth
         , IF(e_date > m.EndOfTheMonth or d.EndDate is NULL,1,0) AS IsDeliveryOpen
    FROM d
    LATERAL VIEW OUTER explode(
        transform(sequence(0, int(months_between(e_date, s_date))), i -> add_months(s_date,i))
    ) m AS EndOfTheMonth

''').show()

每周范围更新:

根据您的评论,要每周执行一次操作,可以使用s_datee_datedate_trunc('WEEK', date_col)调整为同一周的星期一,然后使用{{1 }}函数以7天的间隔生成一个介于 s_date end_date 之间的日期数组,请参见以下代码:

sequence()