假设我有一个包含以下列的表格:
| user| event_start | event_end | show_start | show_end | show_name |
------------------------------------------------------------------------------------------------------------
| 286 | 2018-06-12 00:00:19 | 2018-06-12 00:00:48 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo |
| 287 | 2018-06-12 00:00:45 | 2018-06-12 00:00:53 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo |
| 288 | 2018-06-12 00:00:47 | 2018-06-12 00:00:58 | 2018-06-12 00:00:00 | 2018-06-12 02:00:00 | bar |
...
如何添加一个新列,其中包含表中不同用户的数量,以使他们的event_start
值位于此行的show_start
和show_end
之间?
剩下的表格如下:
| user| event_start | event_end | show_start | show_end | show_name | active_users |
---------------------------------------------------------------------------------------------------------------------------
| 286 | 2018-06-12 00:00:19 | 2018-06-12 00:00:48 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo | 18 |
| 287 | 2018-06-12 00:00:45 | 2018-06-12 00:00:53 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo | 18 |
| 288 | 2018-06-12 00:00:47 | 2018-06-12 00:00:58 | 2018-06-12 00:00:00 | 2018-06-12 02:00:00 | bar | 31 |
...
此功能将用于计算观看每个节目的用户比例与活跃用户的比例。
我有种直觉,我可能需要窗口函数,但是我还不能完全开始构造所需的操作。
答案 0 :(得分:2)
根据评论部分中明确的要求,对于每个不同的节目,似乎需要对活动用户进行全数据帧查找。这可能会很昂贵,特别是如果有很多不同的演出。假设不同节目的数量不是太大(即小到可以collect
编辑),这是一种方法:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import java.sql.Timestamp
val df = Seq(
(286, Timestamp.valueOf("2018-06-12 00:00:19"), Timestamp.valueOf("2018-06-12 00:00:48"),
Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 01:00:00"), "foo"),
(287, Timestamp.valueOf("2018-06-12 00:00:45"), Timestamp.valueOf("2018-06-12 00:00:53"),
Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 01:00:00"), "foo"),
(288, Timestamp.valueOf("2018-06-12 00:00:47"), Timestamp.valueOf("2018-06-12 00:00:58"),
Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
(301, Timestamp.valueOf("2018-06-12 03:00:15"), Timestamp.valueOf("2018-06-12 03:00:45"),
Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
(302, Timestamp.valueOf("2018-06-12 00:00:15"), Timestamp.valueOf("2018-06-12 00:00:30"),
Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
(302, Timestamp.valueOf("2018-06-12 01:00:20"), Timestamp.valueOf("2018-06-12 01:00:50"),
Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
(303, Timestamp.valueOf("2018-06-12 01:00:30"), Timestamp.valueOf("2018-06-12 01:00:45"),
Timestamp.valueOf("2018-06-12 02:00:00"), Timestamp.valueOf("2018-06-12 03:00:00"), "gee")
).toDF("user", "event_start", "event_end", "show_start", "show_end", "show_name")
df.show
// +----+-------------------+-------------------+-------------------+-------------------+---------+
// |user| event_start| event_end| show_start| show_end|show_name|
// +----+-------------------+-------------------+-------------------+-------------------+---------+
// | 286|2018-06-12 00:00:19|2018-06-12 00:00:48|2018-06-12 00:00:00|2018-06-12 01:00:00| foo|
// | 287|2018-06-12 00:00:45|2018-06-12 00:00:53|2018-06-12 00:00:00|2018-06-12 01:00:00| foo|
// | 288|2018-06-12 00:00:47|2018-06-12 00:00:58|2018-06-12 00:00:00|2018-06-12 02:00:00| bar|
// | 301|2018-06-12 03:00:15|2018-06-12 03:00:45|2018-06-12 00:00:00|2018-06-12 02:00:00| bar|
// | 302|2018-06-12 00:00:15|2018-06-12 00:00:30|2018-06-12 00:00:00|2018-06-12 02:00:00| bar|
// | 302|2018-06-12 01:00:20|2018-06-12 01:00:50|2018-06-12 00:00:00|2018-06-12 02:00:00| bar|
// | 303|2018-06-12 01:00:30|2018-06-12 01:00:45|2018-06-12 02:00:00|2018-06-12 03:00:00| gee|
// +----+-------------------+-------------------+-------------------+-------------------+---------+
val showList = df.select($"show_name", $"show_start", $"show_end").
distinct.collect
val showsListNew = showList.map( row => {
val distinctCount = df.select(countDistinct(when($"event_start".between(
row.getTimestamp(1), row.getTimestamp(2)
), $"user"))
).head.getLong(0)
(row.getString(0), row.getTimestamp(1), row.getTimestamp(2), distinctCount)
}
)
// showsListNew: Array[(String, java.sql.Timestamp, java.sql.Timestamp, Long)] = Array(
// (gee, 2018-06-12 02:00:00.0, 2018-06-12 03:00:00.0, 0),
// (bar, 2018-06-12 00:00:00.0, 2018-06-12 02:00:00.0, 5),
// (foo, 2018-06-12 00:00:00.0, 2018-06-12 01:00:00.0, 4)
// )
val showDF = sc.parallelize(showsListNew).toDF("show_name", "show_start", "show_end", "active_users")
df.join(showDF, Seq("show_name", "show_start", "show_end")).
show
// +---------+-------------------+-------------------+----+-------------------+-------------------+------------+
// |show_name| show_start| show_end|user| event_start| event_end|active_users|
// +---------+-------------------+-------------------+----+-------------------+-------------------+------------+
// | gee|2018-06-12 02:00:00|2018-06-12 03:00:00| 303|2018-06-12 01:00:30|2018-06-12 01:00:45| 0|
// | bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 302|2018-06-12 01:00:20|2018-06-12 01:00:50| 5|
// | bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 302|2018-06-12 00:00:15|2018-06-12 00:00:30| 5|
// | bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 301|2018-06-12 03:00:15|2018-06-12 03:00:45| 5|
// | bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 288|2018-06-12 00:00:47|2018-06-12 00:00:58| 5|
// | foo|2018-06-12 00:00:00|2018-06-12 01:00:00| 287|2018-06-12 00:00:45|2018-06-12 00:00:53| 4|
// | foo|2018-06-12 00:00:00|2018-06-12 01:00:00| 286|2018-06-12 00:00:19|2018-06-12 00:00:48| 4|
// +---------+-------------------+-------------------+----+-------------------+-------------------+------------+