我试图弄清楚我在Spark中是否有可能实现的目标。我们假设我有一个CSV,如果作为DataFrame读取,看起来像这样:
+---------------------+-----------+-------+-------------+
| TimeStamp | Customer | User | Application |
+---------------------+-----------+-------+-------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 |
| 2017-01-01 12:00:05 | customer1 | user1 | app1 |
| 2017-01-01 14:00:03 | customer1 | user2 | app2 |
| 2017-01-01 23:50:50 | customer1 | user1 | app1 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 |
+---------------------+-----------+-------+-------------+
我试图生成一个数据框,其中包含某位客户的唯一用户在过去24小时内访问过某个应用程序的次数。所以结果会是这样的:
+---------------------+-----------+-------+-------------+----------------------+
| TimeStamp | Customer | User | Application | UniqueUserVisitedApp |
+---------------------+-----------+-------+-------------+----------------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 | 0 |
| 2017-01-01 12:00:05 | customer1 | user2 | app1 | 1 |
| 2017-01-01 13:00:05 | customer1 | user2 | app1 | 2 |
| 2017-01-01 14:00:03 | customer1 | user1 | app1 | 2 |
| 2017-01-01 23:50:50 | customer1 | user3 | app1 | 2 |
| 2017-01-01 23:50:51 | customer2 | user4 | app2 | 0 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 | 3 |
+---------------------+-----------+-------+-------------+----------------------+
所以我可以使用下面的代码做一个翻滚窗口,但这不是我们想要的。
val data = spark.read.csv('path/to/csv')
val tumblingWindow = data
.groupBy(col("Customer"), col("Application"), window(data.col("TimeStamp"), "24 hours"))
.agg(countDistinct("user")).as("UniqueUsersVisitedApp")
结果如下:
+-----------+-------------+-------------------------+-----------------------+
| Customer | Application | Window | UniqueUsersVisitedApp |
+-----------+-------------+-------------------------+-----------------------+
| customer1 | app1 | [2017-01-01 00:00:00... | 2 |
| customer2 | app2 | [2017-01-01 00:00:00... | 1 |
| customer1 | app1 | [2017-01-02 00:00:00... | 1 |
+-----------+-------------+-------------------------+-----------------------+
非常感谢任何帮助。
答案 0 :(得分:0)
我尝试使用pyspark窗口函数,通过为每个日期创建子分区并对它们应用计数。不确定它们的效率。这是我的代码段,
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import TimestampType
>>> l = [('2017-01-01 00:00:01','customer1','user1','app1'),('2017-01-01 12:00:05','customer1','user1','app1'),('2017-01-01 14:00:03','customer1','user2','app2'),('2017-01-01 23:50:50','customer1','user1','app1'),('2017-01-02 00:00:02','customer1','user1','app1'),('2017-01-02 12:00:02','customer1','user1','app1'),('2017-01-03 14:00:02','customer1','user1','app1'),('2017-01-02 00:00:02','customer1','user2','app2'),('2017-01-01 16:04:01','customer1','user1','app1'),('2017-01-01 23:59:01','customer1','user1','app1'),('2017-01-01 18:00:01','customer1','user2','app2')]
>>> df = spark.createDataFrame(l,['TimeStamp','Customer','User','Application'])
>>> df = df.withColumn('TimeStamp',df['TimeStamp'].cast('timestamp')).withColumn('Date',F.to_date(F.col('TimeStamp')))
>>> df.show()
+-------------------+---------+-----+-----------+----------+
| TimeStamp| Customer| User|Application| Date|
+-------------------+---------+-----+-----------+----------+
|2017-01-01 00:00:01|customer1|user1| app1|2017-01-01|
|2017-01-01 12:00:05|customer1|user1| app1|2017-01-01|
|2017-01-01 14:00:03|customer1|user2| app2|2017-01-01|
|2017-01-01 23:50:50|customer1|user1| app1|2017-01-01|
|2017-01-02 00:00:02|customer1|user1| app1|2017-01-02|
|2017-01-02 12:00:02|customer1|user1| app1|2017-01-02|
|2017-01-03 14:00:02|customer1|user1| app1|2017-01-03|
|2017-01-02 00:00:02|customer1|user2| app2|2017-01-02|
|2017-01-01 16:04:01|customer1|user1| app1|2017-01-01|
|2017-01-01 23:59:01|customer1|user1| app1|2017-01-01|
|2017-01-01 18:00:01|customer1|user2| app2|2017-01-01|
+-------------------+---------+-----+-----------+----------+
>>> df.printSchema()
root
|-- TimeStamp: timestamp (nullable = true)
|-- Customer: string (nullable = true)
|-- User: string (nullable = true)
|-- Application: string (nullable = true)
|-- Date: date (nullable = true)
>>> w = Window.partitionBy('Customer','User','Application','Date').orderBy('Timestamp')
>>> diff = F.coalesce(F.datediff("TimeStamp", F.lag("TimeStamp", 1).over(w)), F.lit(0))
>>> subpartition = F.count(diff<1).over(w)
>>> df.select("*",(subpartition-1).alias('Count')).drop('Date').orderBy('Customer','User','Application','TimeStamp').show()
+-------------------+---------+-----+-----------+-----+
| TimeStamp| Customer| User|Application|Count|
+-------------------+---------+-----+-----------+-----+
|2017-01-01 00:00:01|customer1|user1| app1| 0|
|2017-01-01 12:00:05|customer1|user1| app1| 1|
|2017-01-01 16:04:01|customer1|user1| app1| 2|
|2017-01-01 23:50:50|customer1|user1| app1| 3|
|2017-01-01 23:59:01|customer1|user1| app1| 4|
|2017-01-02 00:00:02|customer1|user1| app1| 0|
|2017-01-02 12:00:02|customer1|user1| app1| 1|
|2017-01-03 14:00:02|customer1|user1| app1| 0|
|2017-01-01 14:00:03|customer1|user2| app2| 0|
|2017-01-01 18:00:01|customer1|user2| app2| 1|
|2017-01-02 00:00:02|customer1|user2| app2| 0|
+-------------------+---------+-----+-----------+-----+
答案 1 :(得分:0)
如果我正确理解您的问题,请在执行groupBy
之前应用过滤器:
data = spark.read.csv('path/to/csv')
result = (data
.filter(data['TimeStamp'] > now_minus_24_hours)
.groupBy(["Customer", "Application", "User"])
.count())
请注意,过去24小时内未访问过的用户将从DataFrame中丢失,而不是计数为零。
如果您尝试获取过去24小时内的访问次数相对于每个时间戳,您可以执行与my answer here类似的操作。基本步骤将是:
reduceByKey
获取每个用户/应用/客户组合的时间戳列表(与其他示例相同)。每行现在将采用以下形式:
((user, app, customer), list_of_timestamps)
处理每个时间戳列表,以生成每个时间戳的“前24小时内的访问次数”列表。现在数据将采用以下形式:
((user, app, customer), [(ts_0, num_visits_24hr_before_ts_0), (ts_1, num_visits_24_hr_before ts_2), ...])
flatMap
使用以下内容将每行返回多行:
lambda row: [(*row[0], *ts_num_visits) for ts_num_visits in row[1]]