Question

假设我有一个事件的DataFrame，每行之间有时差，主要规则是如果只有事件发生在上一个或下一个事件的5分钟内，则计算一次访问：

+--------+-------------------+--------+
|userid  |eventtime          |timeDiff|
+--------+-------------------+--------+
|37397e29|2017-06-04 03:00:00|60      |
|37397e29|2017-06-04 03:01:00|60      |
|37397e29|2017-06-04 03:02:00|60      |
|37397e29|2017-06-04 03:03:00|180     |
|37397e29|2017-06-04 03:06:00|60      |
|37397e29|2017-06-04 03:07:00|420     |
|37397e29|2017-06-04 03:14:00|60      |
|37397e29|2017-06-04 03:15:00|1140    |
|37397e29|2017-06-04 03:34:00|540     |
|37397e29|2017-06-04 03:53:00|540     |
+--------+----------------- -+--------+

挑战是按最新事件时间的start_time和end_time进行分组，条件是在5分钟内。输出应该像这个表：

+--------+-------------------+--------------------+-----------+
|userid  |start_time         |end_time            |events     |
+--------+-------------------+--------------------+-----------+
|37397e29|2017-06-04 03:00:00|2017-06-04 03:07:00 |6          |
|37397e29|2017-06-04 03:14:00|2017-06-04 03:15:00 |2          |
+--------+-------------------+--------------------+-----------+

到目前为止，我已经使用了窗口滞后函数和一些条件，但是，我不知道从哪里开始：

%spark.pyspark
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col

windowSpec = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"])
windowSpecDesc = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"].desc())

# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm 
nextEventTime = F.lag(col("eventtime"), -1).over(windowSpec)

# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm 
previousEventTime = F.lag(col("eventtime"), 1).over(windowSpec)
diffEventTime = nextEventTime - col("eventtime")

nextTimeDiff = F.coalesce((F.unix_timestamp(nextEventTime)
            - F.unix_timestamp('eventtime')), F.lit(0))
previousTimeDiff = F.coalesce((F.unix_timestamp('eventtime') -F.unix_timestamp(previousEventTime)), F.lit(0))


# Check if the next POI is the equal to the current POI and has a time differnce less than 5 minutes.
validation = F.coalesce(( (nextTimeDiff < 300) | (previousTimeDiff < 300) ), F.lit(False))

# Change True to 1
visitCheck = F.coalesce((validation == True).cast("int"), F.lit(1))


result_poi.withColumn("visit_check", visitCheck).withColumn("nextTimeDiff", nextTimeDiff).select("userid", "eventtime", "nextTimeDiff", "visit_check").orderBy("eventtime")

我的问题：这是一种可行的方法，如果是这样，我怎样才能前进？＃34;并查看满足5分钟条件的最长事件时间。据我所知，迭代Spark SQL列的值，是否可能？它不会太贵吗？还有另一种方法可以实现这个结果吗？

@Aku建议的解决方案结果：

+--------+--------+---------------------+---------------------+------+
|userid  |subgroup|start_time           |end_time             |events|
+--------+--------+--------+------------+---------------------+------+
|37397e29|0       |2017-06-04 03:00:00.0|2017-06-04 03:06:00.0|5     |
|37397e29|1       |2017-06-04 03:07:00.0|2017-06-04 03:14:00.0|2     |
|37397e29|2       |2017-06-04 03:15:00.0|2017-06-04 03:15:00.0|1     |
|37397e29|3       |2017-06-04 03:34:00.0|2017-06-04 03:43:00.0|2     |
+------------------------------------+-----------------------+-------+

它没有给出预期的结果。 3:07 - 3:14和03：34-03：43在5分钟内被计为范围，它不应该是那样的。另外，3：07应该是第一行的end_time，因为它在前一行3:06的5分钟内。

Answer 1

您需要一个额外的窗口功能和一个groupby来实现这一目标。我们想要的是timeDiff大于300的每一行都是一个组的结尾和一个新的开始。 Aku的解决方案应该有效，只有指标标记组的开头而不是结束。要改变这一点，你必须做一个累积总和，最多n-1而不是n（n是你当前的行）：

w = Window.partitionBy("userid").orderBy("eventtime")
DF = DF.withColumn("indicator", (DF.timeDiff > 300).cast("int"))
DF = DF.withColumn("subgroup", func.sum("indicator").over(w) - func.col("indicator"))
DF = DF.groupBy("subgroup").agg(
    func.min("eventtime").alias("start_time"), 
    func.max("eventtime").alias("end_time"),
    func.count("*").alias("events")
 )

+--------+-------------------+-------------------+------+
|subgroup|         start_time|           end_time|events|
+--------+-------------------+-------------------+------+
|       0|2017-06-04 03:00:00|2017-06-04 03:07:00|     6|
|       1|2017-06-04 03:14:00|2017-06-04 03:15:00|     2|
|       2|2017-06-04 03:34:00|2017-06-04 03:34:00|     1|
|       3|2017-06-04 03:53:00|2017-06-04 03:53:00|     1|
+--------+-------------------+-------------------+------+

您似乎也只过滤掉一个事件的行，因此：

DF = DF.filter("events != 1")

+--------+-------------------+-------------------+------+
|subgroup|         start_time|           end_time|events|
+--------+-------------------+-------------------+------+
|       0|2017-06-04 03:00:00|2017-06-04 03:07:00|     6|
|       1|2017-06-04 03:14:00|2017-06-04 03:15:00|     2|
+--------+-------------------+-------------------+------+

Answer 2

因此，如果我理解正确的话，你基本上想要在TimeDiff＆gt;结束时结束每个组。 300？滚动窗口函数似乎相对简单：

首先是一些进口

from pyspark.sql.window import Window
import pyspark.sql.functions as func

然后设置窗口，我假设您将按用户标识分区

w = Window.partitionBy("userid").orderBy("eventtime")

然后通过首先标记每个组的第一个成员，然后对列进行求和，找出每个观察所属的子组。

indicator = (TimeDiff > 300).cast("integer")
subgroup = func.sum(indicator).over(w).alias("subgroup")

然后是一些聚合函数，你应该完成

DF = DF.select("*", subgroup)\
.groupBy("subgroup")\
.agg(
    func.min("eventtime").alias("start_time"),
    func.max("eventtime").alias("end_time"),
    func.count(func.lit(1)).alias("events")
)

Answer 3

方法可以根据您的时间线标准对数据框进行分组。

您可以使用打破5分钟时间轴的行创建数据框。这些行是用于对记录进行分组的标准这些行将为每个组设置startime和endtime。

然后找到每个组的计数和最大时间戳（结束时间）。

具有条件

3 个答案: