我目前正在尝试在PySpark数据帧中提取连续出现的一系列事件并对其进行排序/排名,如下所示(为方便起见,我已按$("li[data-position]").attr("data-position", "TEST-VALUE123");
和user_id
订购了初始数据框):
timestamp
df_ini
到:
+-------+--------------------+------------+
|user_id| timestamp | actions |
+-------+--------------------+------------+
| 217498| 100000001| 'A' |
| 217498| 100000025| 'A' |
| 217498| 100000124| 'A' |
| 217498| 100000152| 'B' |
| 217498| 100000165| 'C' |
| 217498| 100000177| 'C' |
| 217498| 100000182| 'A' |
| 217498| 100000197| 'B' |
| 217498| 100000210| 'B' |
| 854123| 100000005| 'A' |
| 854123| 100000007| 'A' |
| etc.
expected df_transformed
我的猜测是我必须使用智能窗口功能,通过user_id和actions 对表进行分区,但仅当这些操作在时间上连续时!我不知道怎么做......
如果有人在PySpark中遇到这种类型的转换,我很高兴得到一个提示!
干杯
答案 0 :(得分:6)
这是一种非常常见的模式,可以通过几个步骤使用窗口函数表示。首先导入所需的功能:
from pyspark.sql.functions import sum as sum_, lag, col, coalesce, lit
from pyspark.sql.window import Window
接下来定义一个窗口:
w = Window.partitionBy("user_id").orderBy("timestamp")
标记每个组的第一行:
is_first = coalesce(
(lag("actions", 1).over(w) != col("actions")).cast("bigint"),
lit(1)
)
定义order
:
order = sum_("is_first").over(w)
将所有部分与聚合结合起来:
(df
.withColumn("is_first", is_first)
.withColumn("order", order)
.groupBy("user_id", "actions", "order")
.count())
如果您将df
定义为:
df = sc.parallelize([
(217498, 100000001, 'A'), (217498, 100000025, 'A'), (217498, 100000124, 'A'),
(217498, 100000152, 'B'), (217498, 100000165, 'C'), (217498, 100000177, 'C'),
(217498, 100000182, 'A'), (217498, 100000197, 'B'), (217498, 100000210, 'B'),
(854123, 100000005, 'A'), (854123, 100000007, 'A')
]).toDF(["user_id", "timestamp", "actions"])
并按user_id
和order
排序结果:
+-------+-------+-----+-----+
|user_id|actions|order|count|
+-------+-------+-----+-----+
| 217498| A| 1| 3|
| 217498| B| 2| 1|
| 217498| C| 3| 2|
| 217498| A| 4| 1|
| 217498| B| 5| 2|
| 854123| A| 1| 2|
+-------+-------+-----+-----+
答案 1 :(得分:2)
我担心使用标准数据帧窗口函数是不可能的。但您仍然可以使用旧的RDD API groupByKey()
来实现转换:
>>> from itertools import groupby
>>>
>>> def recalculate(records):
... actions = [r.actions for r in sorted(records[1], key=lambda r: r.timestamp)]
... groups = [list(g) for k, g in groupby(actions)]
... return [(records[0], g[0], len(g), i+1) for i, g in enumerate(groups)]
...
>>> df_ini.rdd.map(lambda row: (row.user_id, row)) \
... .groupByKey().flatMap(recalculate) \
... .toDF(['user_id', 'actions', 'nf_of_occ', 'order']).show()
+-------+-------+---------+-----+
|user_id|actions|nf_of_occ|order|
+-------+-------+---------+-----+
| 217498| A| 3| 1|
| 217498| B| 1| 2|
| 217498| C| 2| 3|
| 217498| A| 1| 4|
| 217498| B| 2| 5|
| 854123| A| 2| 1|
+-------+-------+---------+-----+