我很有趣,有没有一种方法可以使用Lead \ lag来计算类似的数字
第一步:我有一个数据框
+----+-----------+------+
| id | timestamp | sess |
+----+-----------+------+
| xx | 1 | A |
+----+-----------+------+
| yy | 2 | A |
+----+-----------+------+
| zz | 1 | B |
+----+-----------+------+
| yy | 3 | B |
+----+-----------+------+
| tt | 4 | B |
+----+-----------+------+
我想通过session_id收集特定ID分区之前的ID
+----+---------+
| id | id_list |
+----+---------+
| yy | [xx,zz] |
+----+---------+
| xx | [] |
+----+---------+
| zz | [] |
+----+---------+
| tt | [yy] |
+----+---------+
答案 0 :(得分:1)
您可以在问题中提到的SELECT t.*
FROM (
SELECT species, MIN(date) AS min_date
FROM t
GROUP BY species
) AS a
JOIN t ON a.species = t.species AND a.min_date = t.date
列和window
列上创建sess
。然后,您可以将lag
与聚合函数groupBy
一起使用以获取输出。
collect_list