用当前行的两列之间的列值对DataFrame中的行进行计数。添加为列

时间:2018-06-22 03:49:05

标签: scala apache-spark apache-spark-sql

假设我有一个包含以下列的表格:

| user| event_start         | event_end           | show_start          | show_end            | show_name  |
------------------------------------------------------------------------------------------------------------
| 286 | 2018-06-12 00:00:19 | 2018-06-12 00:00:48 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo        |
| 287 | 2018-06-12 00:00:45 | 2018-06-12 00:00:53 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo        |
| 288 | 2018-06-12 00:00:47 | 2018-06-12 00:00:58 | 2018-06-12 00:00:00 | 2018-06-12 02:00:00 | bar        |
...

如何添加一个新列,其中包含表中不同用户的数量,以使他们的event_start值位于此行的show_startshow_end之间?

剩下的表格如下:

| user| event_start         | event_end           | show_start          | show_end            | show_name  | active_users |
---------------------------------------------------------------------------------------------------------------------------
| 286 | 2018-06-12 00:00:19 | 2018-06-12 00:00:48 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo        | 18           |
| 287 | 2018-06-12 00:00:45 | 2018-06-12 00:00:53 | 2018-06-12 00:00:00 | 2018-06-12 01:00:00 | foo        | 18           |
| 288 | 2018-06-12 00:00:47 | 2018-06-12 00:00:58 | 2018-06-12 00:00:00 | 2018-06-12 02:00:00 | bar        | 31           |
...

此功能将用于计算观看每个节目的用户比例与活跃用户的比例。

我有种直觉,我可能需要窗口函数,但是我还不能完全开始构造所需的操作。

1 个答案:

答案 0 :(得分:2)

根据评论部分中明确的要求,对于每个不同的节目,似乎需要对活动用户进行全数据帧查找。这可能会很昂贵,特别是如果有很多不同的演出。假设不同节目的数量不是太大(即小到可以collect编辑),这是一种方法:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import java.sql.Timestamp

val df = Seq(
  (286, Timestamp.valueOf("2018-06-12 00:00:19"), Timestamp.valueOf("2018-06-12 00:00:48"),
        Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 01:00:00"), "foo"),
  (287, Timestamp.valueOf("2018-06-12 00:00:45"), Timestamp.valueOf("2018-06-12 00:00:53"),
        Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 01:00:00"), "foo"),
  (288, Timestamp.valueOf("2018-06-12 00:00:47"), Timestamp.valueOf("2018-06-12 00:00:58"),
        Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
  (301, Timestamp.valueOf("2018-06-12 03:00:15"), Timestamp.valueOf("2018-06-12 03:00:45"),
        Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
  (302, Timestamp.valueOf("2018-06-12 00:00:15"), Timestamp.valueOf("2018-06-12 00:00:30"),
        Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
  (302, Timestamp.valueOf("2018-06-12 01:00:20"), Timestamp.valueOf("2018-06-12 01:00:50"),
        Timestamp.valueOf("2018-06-12 00:00:00"), Timestamp.valueOf("2018-06-12 02:00:00"), "bar"),
  (303, Timestamp.valueOf("2018-06-12 01:00:30"), Timestamp.valueOf("2018-06-12 01:00:45"),
        Timestamp.valueOf("2018-06-12 02:00:00"), Timestamp.valueOf("2018-06-12 03:00:00"), "gee")
).toDF("user", "event_start", "event_end", "show_start", "show_end", "show_name")

df.show
// +----+-------------------+-------------------+-------------------+-------------------+---------+
// |user|        event_start|          event_end|         show_start|           show_end|show_name|
// +----+-------------------+-------------------+-------------------+-------------------+---------+
// | 286|2018-06-12 00:00:19|2018-06-12 00:00:48|2018-06-12 00:00:00|2018-06-12 01:00:00|      foo|
// | 287|2018-06-12 00:00:45|2018-06-12 00:00:53|2018-06-12 00:00:00|2018-06-12 01:00:00|      foo|
// | 288|2018-06-12 00:00:47|2018-06-12 00:00:58|2018-06-12 00:00:00|2018-06-12 02:00:00|      bar|
// | 301|2018-06-12 03:00:15|2018-06-12 03:00:45|2018-06-12 00:00:00|2018-06-12 02:00:00|      bar|
// | 302|2018-06-12 00:00:15|2018-06-12 00:00:30|2018-06-12 00:00:00|2018-06-12 02:00:00|      bar|
// | 302|2018-06-12 01:00:20|2018-06-12 01:00:50|2018-06-12 00:00:00|2018-06-12 02:00:00|      bar|
// | 303|2018-06-12 01:00:30|2018-06-12 01:00:45|2018-06-12 02:00:00|2018-06-12 03:00:00|      gee|
// +----+-------------------+-------------------+-------------------+-------------------+---------+

val showList = df.select($"show_name", $"show_start", $"show_end").
  distinct.collect

val showsListNew = showList.map( row => {
    val distinctCount = df.select(countDistinct(when($"event_start".between(
        row.getTimestamp(1), row.getTimestamp(2)
      ), $"user"))
    ).head.getLong(0)

    (row.getString(0), row.getTimestamp(1), row.getTimestamp(2), distinctCount)
  }
)
// showsListNew: Array[(String, java.sql.Timestamp, java.sql.Timestamp, Long)] = Array(
//   (gee, 2018-06-12 02:00:00.0, 2018-06-12 03:00:00.0, 0),
//   (bar, 2018-06-12 00:00:00.0, 2018-06-12 02:00:00.0, 5),
//   (foo, 2018-06-12 00:00:00.0, 2018-06-12 01:00:00.0, 4)
// )

val showDF = sc.parallelize(showsListNew).toDF("show_name", "show_start", "show_end", "active_users")

df.join(showDF, Seq("show_name", "show_start", "show_end")).
  show
// +---------+-------------------+-------------------+----+-------------------+-------------------+------------+
// |show_name|         show_start|           show_end|user|        event_start|          event_end|active_users|
// +---------+-------------------+-------------------+----+-------------------+-------------------+------------+
// |      gee|2018-06-12 02:00:00|2018-06-12 03:00:00| 303|2018-06-12 01:00:30|2018-06-12 01:00:45|           0|
// |      bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 302|2018-06-12 01:00:20|2018-06-12 01:00:50|           5|
// |      bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 302|2018-06-12 00:00:15|2018-06-12 00:00:30|           5|
// |      bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 301|2018-06-12 03:00:15|2018-06-12 03:00:45|           5|
// |      bar|2018-06-12 00:00:00|2018-06-12 02:00:00| 288|2018-06-12 00:00:47|2018-06-12 00:00:58|           5|
// |      foo|2018-06-12 00:00:00|2018-06-12 01:00:00| 287|2018-06-12 00:00:45|2018-06-12 00:00:53|           4|
// |      foo|2018-06-12 00:00:00|2018-06-12 01:00:00| 286|2018-06-12 00:00:19|2018-06-12 00:00:48|           4|
// +---------+-------------------+-------------------+----+-------------------+-------------------+------------+