我有一个数据框,如下所示:
from pyspark.sql import functions as fn
from pyspark.sql.types import ArrayType, StringType, StructType, StructField, \
MapType, FloatType, BooleanType, DoubleType, LongType, DateType, IntegerType
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
schema = StructType([
StructField('key', StructType([
StructField('event_name', StringType()),
StructField('event_type', StringType())])),
StructField('uuid', StringType(), False),
StructField('timestamp', LongType(), False),
])
df_test = sqlc.createDataFrame([
["1",['name_1', "type_1"], 1532428811 ],
["1", ['name_2', "type_2"],1532428812],
["1", ['name_2', "type_1"],1532428813],
["2",['name_3', "type_3"], 1532428814],
["2",['name_4', "type_3"], 1532428815],
["2",['name_5', "type_3"], 1532428880]],
schema= schema)
+----+----------------+----------+
|uuid| key| timestamp|
+----+----------------+----------+
| 1|[name_1, type_1]|1532428811|
| 1|[name_2, type_2]|1532428812|
| 1|[name_2, type_1]|1532428813|
| 2|[name_3, type_3]|1532428814|
| 2|[name_4, type_3]|1532428815|
| 2|[name_5, type_3]|1532428880|
+----+----------------+----------+
我想对数据进行分组,以便当任何时间戳(来自同一uuid)之间的差异小于截止阈值并且event_types相同时,每个uuid都有一行。
此外,我想将时间戳的开始和结尾保持为满足条件的行的最小和最大时间戳。
假设截止阈值为50(注意,由于时间戳记差异> 50,下面的最后一行与最后一行是分开的),输出应如下所示:
+----+-----------------------------------------------------------+-----------+-----------+
|uuid| key | start | end |
+----+-----------------------------------------------------------+-----------+-----------+
| 1|[[name_1|type_1, 1532428811], [name_2|type_1, 1532428813]] |1532428811 | 1532428813|
| 1|[[name_2|type_2, 1532428812]] |1532428812 | 1532428812|
| 2|[[name_3|type_3, 1532428814], [name_4|type_3, 1532428815]] |1532428814 | 1532428815|
| 2|[[name_5|type_3, 1532428880]] |1532428880 | 1532428880|
+----+-----------------------------------------------------------+-----------+-----------+