Spark窗口函数范围产生不正确的结果

时间:2017-11-09 10:58:40

标签: scala apache-spark window-functions

我正在尝试在类型为Long的列上使用RangeBetween对Spark DataFrame执行窗口函数,并且窗口的结果不正确。我做错了吗?

这是我的DataFrame:

val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
      Seq(
        Row("2014-11-01 08:10:10.12345", 141482941012345L),
        Row("2014-11-01 09:10:10.12345", 141483301012345L),
        Row("2014-11-01 10:10:10.12345", 141483661012345L),
        Row("2014-11-02 10:10:10.12345", 141492301012345L),
        Row("2014-11-03 10:10:10.12345", 141500941012345L),
        Row("2014-11-04 10:10:10.12345", 141509581012345L),
        Row("2014-11-05 10:10:10.12345", 141518221012345L),
        Row("2014-11-06 10:10:10.12345", 141526861012345L),
        Row("2014-11-07 10:10:10.12345", 141535501012345L),
        Row("2014-11-08 10:10:10.12345", 141544141012345L)
      )
    )
val schema = new StructType()
  .add(StructField("dateTime", StringType, true))
  .add(StructField("unixTime", LongType, true))

val df = spark.createDataFrame(rowsRdd, schema)
df.show(10, false)
df.printSchema()

这是:

+-------------------------+---------------+
|dateTime                 |unixTime       |
+-------------------------+---------------+
|2014-11-01 08:10:10.12345|141482941012345|
|2014-11-01 09:10:10.12345|141483301012345|
|2014-11-01 10:10:10.12345|141483661012345|
|2014-11-02 10:10:10.12345|141492301012345|
|2014-11-03 10:10:10.12345|141500941012345|
|2014-11-04 10:10:10.12345|141509581012345|
|2014-11-05 10:10:10.12345|141518221012345|
|2014-11-06 10:10:10.12345|141526861012345|
|2014-11-07 10:10:10.12345|141535501012345|
|2014-11-08 10:10:10.12345|141544141012345|
+-------------------------+---------------+

架构:

root
 |-- dateTime: string (nullable = true)
 |-- unixTime: long (nullable = true)

第一列是事件的时间戳(字符串,我们不会在实践中使用它),第二列是与时间戳对应的unix时间,单位为10e-5秒。

现在我想在窗口中计算当前行的一些事件。例如,我有3个小时的窗口:

val hour: Long = 60*60*100000L
val w = Window.orderBy(col("unixTime")).rangeBetween(-3*hour, 0)
val df2 = df.withColumn("cts", count(col("dateTime")).over(w)).orderBy(asc("unixTime"))

正确返回:

+-------------------------+---------------+---+
|dateTime                 |unixTime       |cts|
+-------------------------+---------------+---+
|2014-11-01 08:10:10.12345|141482941012345|1  |
|2014-11-01 09:10:10.12345|141483301012345|2  |
|2014-11-01 10:10:10.12345|141483661012345|3  |
|2014-11-02 10:10:10.12345|141492301012345|1  |
|2014-11-03 10:10:10.12345|141500941012345|1  |
|2014-11-04 10:10:10.12345|141509581012345|1  |
|2014-11-05 10:10:10.12345|141518221012345|1  |
|2014-11-06 10:10:10.12345|141526861012345|1  |
|2014-11-07 10:10:10.12345|141535501012345|1  |
|2014-11-08 10:10:10.12345|141544141012345|1  |
+-------------------------+---------------+---+

以下是6小时窗口的结果。为什么结果全部为0?

val hour: Long = 60*60*100000L
val w = Window.orderBy(col("unixTime")).rangeBetween(-6*hour, 0)
val df2 = df.withColumn("cts", count(col("dateTime")).over(w)).orderBy(asc("unixTime"))

+-------------------------+---------------+---+
|dateTime                 |unixTime       |cts|
+-------------------------+---------------+---+
|2014-11-01 08:10:10.12345|141482941012345|0  |
|2014-11-01 09:10:10.12345|141483301012345|0  |
|2014-11-01 10:10:10.12345|141483661012345|0  |
|2014-11-02 10:10:10.12345|141492301012345|0  |
|2014-11-03 10:10:10.12345|141500941012345|0  |
|2014-11-04 10:10:10.12345|141509581012345|0  |
|2014-11-05 10:10:10.12345|141518221012345|0  |
|2014-11-06 10:10:10.12345|141526861012345|0  |
|2014-11-07 10:10:10.12345|141535501012345|0  |
|2014-11-08 10:10:10.12345|141544141012345|0  |
+-------------------------+---------------+---+

这是12小时内发生的事情。为什么现在结果都是1?

val hour: Long = 60*60*100000L
val w = Window.orderBy(col("unixTime")).rangeBetween(-12*hour, 0)
val df2 = df.withColumn("cts", count(col("dateTime")).over(w)).orderBy(asc("unixTime"))

+-------------------------+---------------+---+
|dateTime                 |unixTime       |cts|
+-------------------------+---------------+---+
|2014-11-01 08:10:10.12345|141482941012345|1  |
|2014-11-01 09:10:10.12345|141483301012345|1  |
|2014-11-01 10:10:10.12345|141483661012345|1  |
|2014-11-02 10:10:10.12345|141492301012345|1  |
|2014-11-03 10:10:10.12345|141500941012345|1  |
|2014-11-04 10:10:10.12345|141509581012345|1  |
|2014-11-05 10:10:10.12345|141518221012345|1  |
|2014-11-06 10:10:10.12345|141526861012345|1  |
|2014-11-07 10:10:10.12345|141535501012345|1  |
|2014-11-08 10:10:10.12345|141544141012345|1  |
+-------------------------+---------------+---+

这里发生了什么?它对于任何大范围的值之间都无法正常工作。

编辑:9/11/2017

这是与这个问题有关吗? [SPARK-19451][SQL] rangeBetween method should accept Long value as boundary #18540。它是否已在最新版本的Spark中实现?

1 个答案:

答案 0 :(得分:2)

确实与相关问题有关。 6 * hour是2160000000,大于Integer.MAX_VALUE(2147483647),因此会导致整数溢出:

scala> (6 * hour).toInt
res4: Int = -2134967296

此问题已在当前主数据上修复,将在Spark 2.3中发布。