星期一在Spark

时间:2016-10-26 20:45:09

标签: python apache-spark pyspark apache-spark-sql pyspark-sql

我正在使用Spark 2.0和Python API。

我的数据框的类型为DateType()。我想在包含最近一个星期一的数据框中添加一列。

我可以这样做:

reg_schema = pyspark.sql.types.StructType([
    pyspark.sql.types.StructField('AccountCreationDate', pyspark.sql.types.DateType(), True),
    pyspark.sql.types.StructField('UserId', pyspark.sql.types.LongType(), True)
])
reg = spark.read.schema(reg_schema).option('header', True).csv(path_to_file)
reg = reg.withColumn('monday',
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Mon',
        reg.AccountCreationDate).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate,'E') == 'Tue',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 1)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Wed',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 2)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Thu',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 3)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Fri',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 4)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sat',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 5)).otherwise(
    pyspark.sql.functions.when(pyspark.sql.functions.date_format(reg.AccountCreationDate, 'E') == 'Sun',
        pyspark.sql.functions.date_sub(reg.AccountCreationDate, 6))
        )))))))

然而,对于一些应该相当简单的事情来说,这似乎是很多代码。有没有更简洁的方法来做到这一点?

3 个答案:

答案 0 :(得分:5)

您可以使用next_day确定下一个日期并减去一周。可以按如下方式导入所需的功能:

from pyspark.sql.functions import next_day, date_sub

以及:

def previous_day(date, dayOfWeek):
    return date_sub(next_day(date, "monday"), 7)

最后一个例子:

from pyspark.sql.functions import to_date

df = sc.parallelize([
    ("2016-10-26", )
]).toDF(["date"]).withColumn("date", to_date("date"))

df.withColumn("last_monday", previous_day("date", "monday"))

结果:

+----------+-----------+
|      date|last_monday|
+----------+-----------+
|2016-10-26| 2016-10-24|
+----------+-----------+

答案 1 :(得分:0)

我发现pyspark的功能foos = [Foo(i) for i in range(5)] print(foos) # [Foo, Foo, Foo, Foo, Foo] print(foos[0].x) # 0 test = eval(str(foos)) print(test) # [<class '__main__.Foo'>, <class '__main__.Foo'>, <class '__main__.Foo'>, <class '__main__.Foo'>, <class '__main__.Foo'>] print(test[0].x) # AttributeError: type object 'Foo' has no attribute 'x' 也有效。

trunc

答案 2 :(得分:0)

import pyspark.sql.functions as f

df = df.withColumn('days_from_monday', f.dayofweek(f.col('transaction_timestamp'))-2)      
df = df.withColumn('transaction_week_start_date', f.expr("date_sub(transaction_timestamp, days_from_monday)"))