我有一个下面的数据框,其中包含年,月和周,我需要从年,月和周列中创建一个列日期(如下所示),并将周结束日期视为星期五。
Year Month Weeks date
2018 April 01 W 2018-04-06
2018 April 02 W 2018-04-13
2018 April 03 W 2018-04-20
2018 April 04 W 2018-04-27
2018 May 01 W 2018-05-04
2018 May 02 W 2018-05-11
2018 May 03 W 2018-05-18
2018 May 04 W 2018-05-25
2018 June 01 W 2018-06-01
请有人建议如何在pyspark中实现这一目标。
答案 0 :(得分:3)
您可以在没有任何udf
的情况下执行此操作。逻辑应如下:
Year
和Month
列使用concat
和to_date
创建DateType()
列。该天将是该月的第一天(lit("01")
)。date_trunc
和"week"
作为format
参数来截断该日期。这将返回一个与当前日期之前的最近星期一相对应的日期。7
列中的数字加上Weeks
倍,以获得所需的日期。但是,我们需要考虑一个极端情况:在某些情况下,截断日期+ 4天将在上个月。在这种情况下,我们需要再增加7天。在代码中:
from pyspark.sql.functions import col, concat, date_add, date_trunc
from pyspark.sql.functions import expr, lit, month, substring, to_date, when
def truncate_date(year, month):
"""Assumes year and month are columns"""
dt = concat(year, month, lit("01"))
return date_trunc("week", (to_date(dt, "yyyyMMMdd"))).cast("date")
def get_days_to_add(truncated_date, weeks):
"""If the truncated date + 4 days is in the same month,
we need to skip ahead one extra week"""
return when(
month(date_add(truncated_date, 4)) == month(truncated_date),
(substring(weeks, 1, 2).cast("int"))*7 + 4
).otherwise((substring(weeks, 1, 2).cast("int")-1)*7 + 4)
df.withColumn("truncated_date", truncate_date(col("Year"), col("Month")))\
.withColumn("days_to_add", get_days_to_add(col("truncated_date"), col("Weeks")))\
.withColumn("final_date", expr("date_add(truncated_date, days_to_add)"))\
.show()
#+----+-----+-----+----------+--------------+-----------+----------+
#|Year|Month|Weeks| date|truncated_date|days_to_add|final_date|
#+----+-----+-----+----------+--------------+-----------+----------+
#|2018|April| 01W|2018-04-06| 2018-03-26| 11|2018-04-06|
#|2018|April| 02W|2018-04-13| 2018-03-26| 18|2018-04-13|
#|2018|April| 03W|2018-04-20| 2018-03-26| 25|2018-04-20|
#|2018|April| 04W|2018-04-27| 2018-03-26| 32|2018-04-27|
#|2018| May| 01W|2018-05-04| 2018-04-30| 4|2018-05-04|
#|2018| May| 02W|2018-05-11| 2018-04-30| 11|2018-05-11|
#|2018| May| 03W|2018-05-18| 2018-04-30| 18|2018-05-18|
#|2018| May| 04W|2018-05-25| 2018-04-30| 25|2018-05-25|
#|2018| June| 01W|2018-06-01| 2018-05-28| 4|2018-06-01|
#+----+-----+-----+----------+--------------+-----------+----------+
您可以删除中间的列,但是我已经留下它们来说明逻辑和步骤。
答案 1 :(得分:1)
这是解决此问题的一种方法:
from datetime import datetime
from datetime import timedelta
from pyspark.sql.types import *
df = spark.createDataFrame([(2018, 'April', '01 W'),
(2018, 'April', '02 W'),
(2018, 'April', '03 W'),
(2018, 'April', '04 W'),
(2018, 'May', '01 W'),
(2018, 'May', '02 W'),
(2018, 'May', '03 W'),
(2018, 'May', '04 W'),
(2018, 'June', '01 W')
],
["Year", "Month", "Weeks"])
df = df.withColumn('week_number', F.regexp_extract(df['Weeks'], r'(\d+) ',1).cast(IntegerType()))
md = {'April':'04', 'May':'05', 'June':'06'}
df = df.withColumn('month_number', F.udf(lambda r: md[r])(df['Month']))
df = df.withColumn('yyyymm', F.concat_ws('-', df['Year'], df['month_number']))
df = df.withColumn('first_date', F.to_date(df['yyyymm'], 'yyyy-MM'))
df = df.withColumn('first_date', F.date_sub(df['first_date'], 1))
df = df.withColumn('first_date', F.next_day(df['first_date'], 'Fri'))
df = df.withColumn('date', F.lit(''))
df.show()
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_fri(pdf):
s = pd.to_datetime(pdf['first_date'], format = '%Y-%m-%d')
days = s + pd.to_timedelta((pdf['week_number']-1)*7, unit='day')
pdf['date'] = days.dt.strftime("%Y-%m-%d")
return pdf
df = df.groupby(['Year', 'Month']).apply(_calc_fri).orderBy(['Year', 'month_number', 'week_number'])
df.show()
输出:
+----+-----+-----+-----------+------------+-------+----------+----------+
|Year|Month|Weeks|week_number|month_number| yyyymm|first_date| date|
+----+-----+-----+-----------+------------+-------+----------+----------+
|2018|April| 01 W| 1| 04|2018-04|2018-04-06|2018-04-06|
|2018|April| 02 W| 2| 04|2018-04|2018-04-06|2018-04-13|
|2018|April| 03 W| 3| 04|2018-04|2018-04-06|2018-04-20|
|2018|April| 04 W| 4| 04|2018-04|2018-04-06|2018-04-27|
|2018| May| 01 W| 1| 05|2018-05|2018-05-04|2018-05-04|
|2018| May| 02 W| 2| 05|2018-05|2018-05-04|2018-05-11|
|2018| May| 03 W| 3| 05|2018-05|2018-05-04|2018-05-18|
|2018| May| 04 W| 4| 05|2018-05|2018-05-04|2018-05-25|
|2018| June| 01 W| 1| 06|2018-06|2018-06-01|2018-06-01|
+----+-----+-----+-----------+------------+-------+----------+----------+
猜想您也可以将所有工作放到pandas_udf
中,或使用udf
,我个人将尝试在任何udf
中减少工作量。