Question

我有一个时间访问的数据库，时间戳就像这样

python filename.py

我使用spark SQL，我需要为每个ID设置最长的连续日期序列，如

ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800

我尝试将这个答案Detect consecutive dates ranges using SQL改编为我的案例，但我没有达到我的期望。

ID, longest_seq (days)
1, 2
2, 5
3, 1

如果有人对如何解决此请求有一些线索，或者其中有什么错误，我将不胜感激感谢

[编辑]更明确的输入/输出

 SELECT ID, MIN (d), MAX(d)
    FROM (
      SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d, 
                ROW_NUMBER() OVER(
         PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST') 
                                                           as date)) rn
      FROM purchase
      where ID is not null
      GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) 
    )
    GROUP BY ID, rn
    ORDER BY ID

结果将是：

ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11

所有访问都是时间戳，但我需要连续几天，然后按天计算每天一次

Answer 1

我在下面的回答改编自https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even，用于Spark SQL。您将使用以下命令包装SQL查询：

spark.sql("""
SQL_QUERY
""")

所以，对于第一个查询：

CREATE TABLE intermediate_1 AS
SELECT 
    id,
    time,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
    time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase

这会给你：

id, time, rn, grp
1,  1,    1,  0
1,  2,    2,  0
1,  3,    3,  0
2,  1,    1,  0
2,  3,    2,  1
2,  4,    3,  1
2,  5,    4,  1
2,  10,   5,  5
2,  11,   6,  5
3,  1,    1,  0
3,  4,    2,  2
3,  9,    3,  6
3,  11,   4,  7

我们可以看到连续的行具有相同的grp值。然后我们将使用GROUP BY和COUNT来获得连续时间的数量。

CREATE TABLE intermediate_2 AS
SELECT 
    id,
    grp,
    COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp

这将返回：

id, grp, num_consecutive
1,  0,   3
2,  0,   1
2,  1,   3
2,  5,   2
3,  0,   1
3,  2,   1
3,  6,   1
3,  7,   1

现在我们只使用MAX和GROUP BY来获得最大连续时间。

CREATE TABLE final AS
SELECT 
    id,
    MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id

哪个会给你：

id, max_consecutive
1,  3
2,  3
3,  1

希望这有帮助！

Answer 2

我心爱的窗口聚合函数的情况就是这样！

我认为以下示例可以帮助您（至少开始使用）。

以下是我使用的数据集。我将你的时间（长篇）翻译成数字时间来表示一天（并避免在Spark SQL中乱搞时间戳，这可能会使解决方案难以理解...... 可能）。

在下面的visit数据集中，time列表示日期之间的天数，因此1逐个表示连续几天。

scala> visits.show
+---+----+
| ID|time|
+---+----+
|  1|   1|
|  1|   1|
|  1|   2|
|  1|   3|
|  1|   3|
|  1|   3|
|  2|   1|
|  3|   1|
|  3|   2|
|  3|   2|
+---+----+

让我们定义窗口规范，将id行组合在一起。

import org.apache.spark.sql.expressions.Window
val idsSortedByTime = Window.
  partitionBy("id").
  orderBy("time")

使用rank行{和}计算具有相同排名的行。

val answer = visits.
  select($"id", $"time", rank over idsSortedByTime as "rank").
  groupBy("id", "time", "rank").
  agg(count("*") as "count")
scala> answer.show
+---+----+----+-----+
| id|time|rank|count|
+---+----+----+-----+
|  1|   1|   1|    2|
|  1|   2|   3|    1|
|  1|   3|   4|    3|
|  3|   1|   1|    1|
|  3|   2|   2|    2|
|  2|   1|   1|    1|
+---+----+----+-----+

出现（非常接近？）解决方案。 你好像已经完成了！

Answer 3

使用spark.sql和中间表

scala> val df = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("id","time")
df: org.apache.spark.sql.DataFrame = [id: int, time: int]

scala> df.createOrReplaceTempView("tb1")

scala> spark.sql(""" with tb2(select id,time, time-row_number() over(partition by id order by time) rw1 from tb1), tb3(select id,count(rw1) rw2 from tb2 group by id,rw1) select id, rw2 from tb3 where (id,rw2) in (select id,max(rw2) from tb3 group by id) group by id, rw2 """).show(false)
+---+---+
|id |rw2|
+---+---+
|1  |3  |
|3  |1  |
|2  |3  |
+---+---+


scala>

如何找到最长的连续日期序列？

3 个答案: