这个问题是this answer的跟进。出现以下情况时,Spark显示错误:
# Group results in 12 second windows of "foo", then by integer buckets of 2 for "bar"
fooWindow = window(col("foo"), "12 seconds"))
# A sub bucket that contains values in [0,2), [2,4), [4,6]...
barWindow = window(col("bar").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")
results = df.groupBy(fooWindow, barWindow).count()
错误是:
“多个时间窗口表达式将产生笛卡尔积 行,因此目前不支持。”
有什么方法可以实现预期的行为?
答案 0 :(得分:4)
我能够通过改编this SO answer来提出解决方案。
注意:此解决方案仅在最多调用window
时有效,这意味着不允许使用多个时间窗口。在spark github上进行快速搜索表明存在<= 1
个窗口的硬限制。
通过使用withColumn
为每一行定义存储桶,然后我们可以直接按该新列进行分组:
from pyspark.sql import functions as F
from datetime import datetime as dt, timedelta as td
start = dt.now()
second = td(seconds=1)
data = [(start, 0), (start+second, 1), (start+ (12*second), 2)]
df = spark.createDataFrame(data, ('foo', 'bar'))
# Create a new column defining the window for each bar
df = df.withColumn("barWindow", F.col("bar") - (F.col("bar") % 2))
# Keep the time window as is
fooWindow = F.window(F.col("foo"), "12 seconds").start.alias("foo")
# Use the new column created
results = df.groupBy(fooWindow, F.col("barWindow")).count().show()
# +-------------------+---------+-----+
# | foo|barWindow|count|
# +-------------------+---------+-----+
# |2019-01-24 14:12:48| 0| 2|
# |2019-01-24 14:13:00| 2| 1|
# +-------------------+---------+-----+