我在pyspark上使用SparkSQL将一些PostgreSQL表存储到DataFrames中,然后构建一个查询,根据类型为{{1}的 for (id gestureRecognizer in self.view.gestureRecognizers){
//Compare and Remove which gesture you want
if ([gestureRecognizer isKindOfClass:[UILongPressGestureRecognizer class]])
{
[self.view removeGestureRecognizer:gestureRecognizer];
}
else if ([gestureRecognizer isKindOfClass:[UITapGestureRecognizer class]]) {
}
}
和start
列生成多个时间序列}。
假设stop
包含:
date
在PostgreSQL中,这很容易做到:
my_table
它将生成此表:
start | stop
-------------------------
2000-01-01 | 2000-01-05
2012-03-20 | 2012-03-23
但是如何使用普通的SparkSQL呢?是否有必要使用UDF或一些DataFrame方法?
答案 0 :(得分:10)
@Rakesh回答是正确的,但我想分享一个不那么详细的解决方案:
import datetime
import pyspark.sql.types
from pyspark.sql.functions import UserDefinedFunction
# UDF
def generate_date_series(start, stop):
return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
# Register UDF for later usage
spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()) )
# mydf is a DataFrame with columns `start` and `stop` of type DateType()
mydf.createOrReplaceTempView("mydf")
spark.sql("SELECT explode(generate_date_series(start, stop)) FROM mydf").show()
答案 1 :(得分:2)
编辑
这将创建一个数据框,其中的一行包含连续的日期数组:
from pyspark.sql.functions import sequence, to_date, explode, col
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date")
+------------------------------------------+
| date |
+------------------------------------------+
| ["2018-01-01","2018-02-01","2018-03-01"] |
+------------------------------------------+
您可以使用explode函数将该数组“透视”为行:
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date").withColumn("date", explode(col("date"))
+------------+
| date |
+------------+
| 2018-01-01 |
| 2018-02-01 |
| 2018-03-01 |
+------------+
(编辑结束)
Spark v2.4支持sequence
功能:
序列(开始,停止,步进)-从开始到停止(包括端点)生成元素数组,并逐步增加。返回的元素的类型与参数表达式的类型相同。
支持的类型是:字节,短整数,长整数,日期,时间戳记。
示例:
SELECT序列(1,5);
[1,2,3,4,5]
SELECT序列(5,1);
[5,4,3,2,1]
SELECT顺序(to_date('2018-01-01'),to_date('2018-03-01'),间隔1个月);
[2018-01-01,2018-02-01,2018-03-01]
https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence
答案 2 :(得分:1)
现有答案会起作用,但非常效率低下。相反,最好使用range
然后投射数据。在Python中
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
def generate_series(start, stop, interval):
"""
:param start - lower bound, inclusive
:param stop - upper bound, exclusive
:interval int - increment interval in seconds
"""
spark = SparkSession.builder.getOrCreate()
# Determine start and stops in epoch seconds
start, stop = spark.createDataFrame(
[(start, stop)], ("start", "stop")
).select(
[col(c).cast("timestamp").cast("long") for c in ("start", "stop")
]).first()
# Create range with increments and cast to timestamp
return spark.range(start, stop, interval).select(
col("id").cast("timestamp").alias("value")
)
用法示例:
generate_series("2000-01-01", "2000-01-05", 60 * 60).show(5) # By hour
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-01 01:00:00|
|2000-01-01 02:00:00|
|2000-01-01 03:00:00|
|2000-01-01 04:00:00|
+-------------------+
only showing top 5 rows
generate_series("2000-01-01", "2000-01-05", 60 * 60 * 24).show() # By day
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+
答案 3 :(得分:0)
假设您有来自spark sql的数据框df
,请尝试此操作
from pyspark.sql.functions as F
from pyspark.sql.types as T
def timeseriesDF(start, total):
series = [start]
for i xrange( total-1 ):
series.append(
F.date_add(series[-1], 1)
)
return series
df.withColumn("t_series", F.udf(
timeseriesDF,
T.ArrayType()
) ( df.start, F.datediff( df.start, df.stop ) )
).select(F.explode("t_series")).show()
答案 4 :(得分:0)
建立在user10938362答案的基础上,只是显示了一种在没有UDF的情况下使用范围的方法,前提是您要尝试基于某些已摄取的数据集构建日期的数据框,而不是使用硬编码的开始/停止。
# start date is min date
date_min=int(df.agg({'date': 'min'}).first()[0])
# end date is current date or alternatively could use max as above
date_max=(
spark.sql('select unix_timestamp(current_timestamp()) as date_max')
.collect()[0]['date_max']
)
# range is int, unix time is s so 60*60*24=day
df=spark.range(date_min, date_max, 60*60*24).select('id')