pyspark上的SparkSQL:如何生成时间序列?

时间:2017-03-31 13:14:19

标签: python-2.7 pyspark time-series apache-spark-sql pyspark-sql

我在pyspark上使用SparkSQL将一些PostgreSQL表存储到DataFrames中,然后构建一个查询,根据类型为{{1}的 for (id gestureRecognizer in self.view.gestureRecognizers){ //Compare and Remove which gesture you want if ([gestureRecognizer isKindOfClass:[UILongPressGestureRecognizer class]]) { [self.view removeGestureRecognizer:gestureRecognizer]; } else if ([gestureRecognizer isKindOfClass:[UITapGestureRecognizer class]]) { } } start列生成多个时间序列}。

假设stop包含:

date

在PostgreSQL中,这很容易做到:

my_table

它将生成此表:

 start      | stop       
-------------------------
 2000-01-01 | 2000-01-05 
 2012-03-20 | 2012-03-23 

但是如何使用普通的SparkSQL呢?是否有必要使用UDF或一些DataFrame方法?

5 个答案:

答案 0 :(得分:10)

@Rakesh回答是正确的,但我想分享一个不那么详细的解决方案:

import datetime
import pyspark.sql.types
from pyspark.sql.functions import UserDefinedFunction

# UDF
def generate_date_series(start, stop):
    return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]    

# Register UDF for later usage
spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()) )

# mydf is a DataFrame with columns `start` and `stop` of type DateType()
mydf.createOrReplaceTempView("mydf")

spark.sql("SELECT explode(generate_date_series(start, stop)) FROM mydf").show()

答案 1 :(得分:2)

编辑
这将创建一个数据框,其中的一行包含连续的日期数组:

from pyspark.sql.functions import sequence, to_date, explode, col

spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date")

+------------------------------------------+
|                  date                    |
+------------------------------------------+
| ["2018-01-01","2018-02-01","2018-03-01"] |
+------------------------------------------+

您可以使用explode函数将该数组“透视”为行:

spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date").withColumn("date", explode(col("date"))

+------------+
|    date    |
+------------+
| 2018-01-01 |
| 2018-02-01 |
| 2018-03-01 |
+------------+

(编辑结束)

Spark v2.4支持sequence功能:

  

序列(开始,停止,步进)-从开始到停止(包括端点)生成元素数组,并逐步增加。返回的元素的类型与参数表达式的类型相同。

     

支持的类型是:字节,短整数,长整数,日期,时间戳记。

     

示例:

     

SELECT序列(1,5);

     

[1,2,3,4,5]

     

SELECT序列(5,1);

     

[5,4,3,2,1]

     

SELECT顺序(to_date('2018-01-01'),to_date('2018-03-01'),间隔1个月);

     

[2018-01-01,2018-02-01,2018-03-01]

https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence

答案 2 :(得分:1)

现有答案会起作用,但非常效率低下。相反,最好使用range然后投射数据。在Python中

from pyspark.sql.functions import col
from pyspark.sql import SparkSession

def generate_series(start, stop, interval):
    """
    :param start  - lower bound, inclusive
    :param stop   - upper bound, exclusive
    :interval int - increment interval in seconds
    """
    spark = SparkSession.builder.getOrCreate()
    # Determine start and stops in epoch seconds
    start, stop = spark.createDataFrame(
        [(start, stop)], ("start", "stop")
    ).select(
        [col(c).cast("timestamp").cast("long") for c in ("start", "stop")
    ]).first()
    # Create range with increments and cast to timestamp
    return spark.range(start, stop, interval).select(
        col("id").cast("timestamp").alias("value")
    )

用法示例:

generate_series("2000-01-01", "2000-01-05", 60 * 60).show(5)  # By hour
+-------------------+
|              value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-01 01:00:00|
|2000-01-01 02:00:00|
|2000-01-01 03:00:00|
|2000-01-01 04:00:00|
+-------------------+
only showing top 5 rows
generate_series("2000-01-01", "2000-01-05", 60 * 60 * 24).show()  # By day
+-------------------+
|              value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+

答案 3 :(得分:0)

假设您有来自spark sql的数据框df,请尝试此操作

from pyspark.sql.functions as F
from pyspark.sql.types as T

def timeseriesDF(start, total):
    series = [start]
    for i xrange( total-1 ):
        series.append(
            F.date_add(series[-1], 1)
        )
    return series

df.withColumn("t_series", F.udf(
                timeseriesDF, 
                T.ArrayType()
            ) ( df.start, F.datediff( df.start, df.stop ) ) 
    ).select(F.explode("t_series")).show()

答案 4 :(得分:0)

建立在user10938362答案的基础上,只是显示了一种在没有UDF的情况下使用范围的方法,前提是您要尝试基于某些已摄取的数据集构建日期的数据框,而不是使用硬编码的开始/停止。

# start date is min date
date_min=int(df.agg({'date': 'min'}).first()[0]) 
# end date is current date or alternatively could use max as above
date_max=(
    spark.sql('select unix_timestamp(current_timestamp()) as date_max')
    .collect()[0]['date_max']
    )
# range is int, unix time is s so 60*60*24=day
df=spark.range(date_min, date_max, 60*60*24).select('id')