我的想法是计算月底未交货的数量。这是我的数据帧df
。 (SOD:交货开始,EOD /交货结束):
Reference Start Date StartTimestamp EndDate EndTimestamp
1 2/15/2019 SOD 4/18/2019 EOD
2 2/16/2019 SOD 2/23/2019 EOD
3 2/17/2019 SOD 3/4/2019 EOD
4 3/1/2019 SOD Null Null
from pyspark.sql.functions import col, when, explode
from pyspark.sql import functions as F
df1 = df.withColumn("EndOfTheMonth", F.lastday("Start Date"))
.withColumn("IsDeliveryOpen", when((col("Start Date") <= col("EndOfTheMonth")) & ((col("EndDate") >= col("EndOfTheMonth")) |
(col("EndTask").isNull())),1).otherwise(0))
为此,我想在交付的End Date
高于“ EndOfTheMonth”时每月在新数据框中复制行。
我尝试使用explode
来执行此操作,但是我不知道如何使用此功能:
df2 = (df1.filter(col("IsDeliveryOpen") == 1)
.select("Reference").explode()
.withColumn("EndOfTheMonth", F.add_months(lastday("StartDate"), 1))
所需的输出应能够groupBy
EndOfTheMonth
:
Reference Start Date StartTimestamp EndDate EndTimestamp EndOfTheMonth IsDeliveryOpen
1 8/15/2019 SOD 9/18/2019 EOD 8/31/2019 1
1 8/15/2019 SOD 9/18/2019 EOD 9/30/2019 0
2 8/16/2019 SOD 8/23/2019 EOD 8/31/2019 0
3 6/17/2019 SOD 8/4/2019 EOD 6/30/2019 1
3 6/17/2019 SOD 8/4/2019 EOD 7/31/2019 1
3 6/17/2019 SOD 8/4/2019 EOD 8/31/2019 0
4 8/1/2019 SOD Null Null 8/31/2019 1
4 8/1/2019 SOD Null Null 9/30/2019 1
答案 0 :(得分:0)
对于 Spark 2.4.0 + ,您可以使用 sequence + transform +爆炸以创建新的执行此任务的行:
from pyspark.sql.functions import expr
df_new = (df
.withColumn('s_date', expr("last_day(to_date(StartDate, 'M/d/yyyy'))"))
.withColumn('e_date', expr("last_day(IFNULL(to_date(EndDate, 'M/d/yyyy'), add_months(current_date(),-1)))"))
.withColumn('EndOfTheMonth', expr('''
explode_outer(transform(
sequence(0, int(months_between(e_date, s_date))), i -> add_months(s_date,i)
))
'''))
.withColumn('IsDeliveryOpen', expr("IF(e_date > EndOfTheMonth or EndDate is Null, 1, 0)"))
)
df_new.show()
+---------+---------+--------------+---------+------------+----------+----------+-------------+--------------+
|Reference|StartDate|StartTimestamp| EndDate|EndTimestamp| s_date| e_date|EndOfTheMonth|IsDeliveryOpen|
+---------+---------+--------------+---------+------------+----------+----------+-------------+--------------+
| 1|8/15/2019| SOD|9/18/2019| EOD|2019-08-31|2019-09-30| 2019-08-31| 1|
| 1|8/15/2019| SOD|9/18/2019| EOD|2019-08-31|2019-09-30| 2019-09-30| 0|
| 2|8/16/2019| SOD|8/23/2019| EOD|2019-08-31|2019-08-31| 2019-08-31| 0|
| 3|6/17/2019| SOD| 8/4/2019| EOD|2019-06-30|2019-08-31| 2019-06-30| 1|
| 3|6/17/2019| SOD| 8/4/2019| EOD|2019-06-30|2019-08-31| 2019-07-31| 1|
| 3|6/17/2019| SOD| 8/4/2019| EOD|2019-06-30|2019-08-31| 2019-08-31| 0|
| 4| 8/1/2019| SOD| null| null|2019-08-31|2019-09-30| 2019-08-31| 1|
| 4| 8/1/2019| SOD| null| null|2019-08-31|2019-09-30| 2019-09-30| 1|
+---------+---------+--------------+---------+------------+----------+----------+-------------+--------------+
df_new = df_new.drop('s_date', 'e_date')
工作方式:
将 StartDate , EndDate 转换为DateType,并将其值转换为同一月的最后一天( s_date , e_date )。如果 EndDate 为NULL,则将其值设置为从current_date
计算上述两个日期之间的#月数,然后创建一个序列(0,#months)并将其转换为StartDate之间的一个月数数组(EndOfTheMonth
)和EndDate(含)
使用explode_outer生成上述数组中所有月份的行
相应地计算IsDeliveryOpen标志。我删除了代码中的StartDate <= EndOfTheMonth
,因为根据 EndOfTheMonth 的计算方式,它始终为真。
注意:以上内容也可以写成一条SQL语句:
df.createOrReplaceTempView('t_df')
spark.sql('''
WITH d AS (
SELECT *
, last_day(to_date(StartDate, 'M/d/yyyy')) as s_date
, last_day(IFNULL(to_date(EndDate, 'M/d/yyyy'),add_months(current_date(),-1))) as e_date
FROM t_df
)
SELECT d.*
, m.EndOfTheMonth
, IF(e_date > m.EndOfTheMonth or d.EndDate is NULL,1,0) AS IsDeliveryOpen
FROM d
LATERAL VIEW OUTER explode(
transform(sequence(0, int(months_between(e_date, s_date))), i -> add_months(s_date,i))
) m AS EndOfTheMonth
''').show()
根据您的评论,要每周执行一次操作,可以使用s_date
将e_date
和date_trunc('WEEK', date_col)
调整为同一周的星期一,然后使用{{1 }}函数以7天的间隔生成一个介于 s_date 和 end_date 之间的日期数组,请参见以下代码:
sequence()