Pyspark将年和周数转换为week_start日期和week_end日期

时间:2020-07-22 09:00:44

标签: python pyspark apache-spark-sql pyspark-dataframes

我正在尝试使用Pyspark数据框将年和周数转换为星期开始日期和星期结束日期。

type MyThunkDispatch = ThunkDispatch<{}, {}, AnyAction>;
const thunkDispatch = store.dispatch as MyThunkDispatch;

thunkDispatch(DBActions.startDatabase());

当我尝试应用以下代码时,我收到列对象不可迭代的错误

+---------+
|year_week|
+---------+
| 2019-W51|
| 2019-W52|
| 2020-W01|
| 2020-W02|
| 2020-W03|
| 2020-W04|
| 2020-W05|
| 2020-W06|
| 2020-W07|
+---------+

错误

df = df.withColumn('week_start_date', df.year_week.apply(lambda x: datetime.datetime.strptime(d + '-1', "%Y-W%W-%w")))

预期结果是:

TypeError: 'Column' object is not callable

2 个答案:

答案 0 :(得分:1)

试试这个-

 // week starting from monday, concat "-1", for tuesday "-2" etc.
    val p = df2.withColumn("week_start", to_date(concat($"year_week", lit("-1")), "YYYY-'W'ww-u"))
      .withColumn("week_end", next_day($"week_start", "SUN"))
    p.show(false)
    p.printSchema()

    /**
      * +---------+----------+----------+
      * |year_week|week_start|week_end  |
      * +---------+----------+----------+
      * |2019-W51 |2019-12-16|2019-12-22|
      * |2019-W52 |2019-12-23|2019-12-29|
      * |2020-W01 |2019-12-30|2020-01-05|
      * |2020-W02 |2020-01-06|2020-01-12|
      * |2020-W03 |2020-01-13|2020-01-19|
      * |2020-W04 |2020-01-20|2020-01-26|
      * |2020-W05 |2020-01-27|2020-02-02|
      * |2020-W06 |2020-02-03|2020-02-09|
      * |2020-W07 |2020-02-10|2020-02-16|
      * +---------+----------+----------+
      *
      * root
      * |-- year_week: string (nullable = true)
      * |-- week_start: date (nullable = true)
      * |-- week_end: date (nullable = true)
      */

答案 1 :(得分:1)

基于Someshwar的响应,只需对scala中的python进行少量更改。

df1 = df.withColumn("week_start", F.to_date(F.concat(F.col("year_week"), F.lit("-1")), "YYYY-'W'ww-u")).withColumn("week_end", F.next_day(F.col("week_start"), "SUN"))