如何通过在pyspark sql中循环日期来获取数据?

时间:2019-06-13 20:18:54

标签: pyspark apache-spark-sql pyspark-sql

我有一个脚本,其中使用Spark sql将数据提取到pyspark DataFrame中。脚本如下所示:

from pyspark import SparkContext, SparkConf, HiveContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

df_query = """
select 
  *
from schema.table
where start_date between date '2019-03-01' and date '2019-03-07'
"""
df = spark.sql(df_query) 

当前,该脚本提取特定一周的数据。但是,我想在整个星期中迭代此脚本。我怎样才能做到这一点?

1 个答案:

答案 0 :(得分:1)

您可以使用timedelta类:

import datetime

startDate = datetime.datetime.strptime('2019-03-01', "%Y-%m-%d")
maxDate = datetime.datetime.strptime('2019-04-03', "%Y-%m-%d")


while startDate <= maxDate:
    endDate = startDate + datetime.timedelta(days=7)

    df_query = """
select 
  *
from schema.table
where start_date between date '{}' and date '{}'
""".format(startDate.date(), endDate.date())
    print(df_query)
    startDate = endDate + datetime.timedelta(days=1)