我想计算每个用户每个SeqID所花费的时间。我有一个这样的数据框。
但是,每个用户Action_A and Action_B.
的时间在两个操作之间分配
所有用户对和每个seqID的总时间将是所有此类对的总和
对于第一个用户,它是5 + 3 [(2019-12-10 10:00:00 - 2019-12-10 10:05:00) + (2019-12-10 10:20:00 - 2019-12-10 10:23:00)]
因此,理想情况下,第一个用户已将8 mins
用于SeqID 1(而不是23 mins
)。
同样,用户2花费了1 + 5 = 6 mins
我如何使用pyspark进行计算?
data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
(("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
(("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
(("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
(("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.show()
+---+-----+-------------------+--------+
| ID|SeqID| Timestamp| Action|
+---+-----+-------------------+--------+
|ID1| 15|2019-12-10 10:00:00|Action_A|
|ID1| 15|2019-12-10 10:05:00|Action_B|
|ID1| 15|2019-12-10 10:20:00|Action_A|
|ID1| 15|2019-12-10 10:23:00|Action_B|
|ID2| 23|2019-12-10 11:10:00|Action_A|
|ID2| 23|2019-12-10 11:11:00|Action_B|
|ID2| 23|2019-12-10 11:30:00|Action_A|
|ID2| 23|2019-12-10 11:35:00|Action_B|
+---+-----+-------------------+--------+
一旦有了每对数据,我就可以对整个组(ID,SeqID)求和
预期的输出(也可能是秒)
+---+-----+--------+
| ID|SeqID|Dur_Mins|
+---+-----+--------+
|ID1| 15| 8|
|ID2| 23| 6|
+---+-----+--------+
答案 0 :(得分:2)
这是使用Higher-Order Functions(火花> = 2.4)的可能解决方案:
transform_expr = "transform(ts_array, (x,i) -> (unix_timestamp(ts_array[i+1]) - unix_timestamp(x))/60 * ((i+1)%2))"
df.groupBy("ID", "SeqID").agg(array_sort(collect_list(col("Timestamp"))).alias("ts_array")) \
.withColumn("transformed_ts_array", expr(transform_expr)) \
.withColumn("Dur_Mins", expr("aggregate(transformed_ts_array, 0D, (acc, x) -> acc + coalesce(x, 0D))")) \
.drop("transformed_ts_array", "ts_array") \
.show(truncate=False)
步骤:
ID
,SeqID
收集所有时间戳以进行排列,并按升序对其进行排序(x, i) => Double
将变换应用于数组。其中x
是实际元素,i
是其索引。对于数组中的每个时间戳,我们将计算下一个时间戳的差异。并且我们乘以(i+1)%2
以便仅使diff每2个对为2对(第一个与第二个,第三个与第四个,...),因为总是有2个动作。 输出:
+---+-----+--------+
|ID |SeqID|Dur_Mins|
+---+-----+--------+
|ID1|15 |8.0 |
|ID2|23 |6.0 |
+---+-----+--------+
答案 1 :(得分:1)
使用flatMapValues
和rdd
的一种可能的方法(可能也会很复杂)
使用您的data
变量
df = spark.createDataFrame(data, ["id", "seq_id", "ts", "action"]). \
withColumn('ts', func.col('ts').cast('timestamp'))
# func to calculate the duration | applied on each row
def getDur(groupedrows):
"""
"""
res = []
for row in groupedrows:
if row.action == 'Action_A':
frst_ts = row.ts
dur = 0
elif row.action == 'Action_B':
dur = (row.ts - frst_ts).total_seconds()
res.append([val for val in row] + [float(dur)])
return res
# run the rules on the base df | row by row
# grouped on ID, SeqID - sorted on timestamp
dur_rdd = df.rdd. \
groupBy(lambda k: (k.id, k.seq_id)). \
flatMapValues(lambda r: getDur(sorted(r, key=lambda ok: ok.ts))). \
values()
# specify final schema
dur_schema = df.schema. \
add('dur', 'float')
# convert to DataFrame
dur_sdf = spark.createDataFrame(dur_rdd, dur_schema)
dur_sdf.orderBy('id', 'seq_id', 'ts').show()
+---+------+-------------------+--------+-----+
| id|seq_id| ts| action| dur|
+---+------+-------------------+--------+-----+
|ID1| 15|2019-12-10 10:00:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:05:00|Action_B|300.0|
|ID1| 15|2019-12-10 10:20:00|Action_A| 0.0|
|ID1| 15|2019-12-10 10:23:00|Action_B|180.0|
|ID2| 23|2019-12-10 11:10:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:11:00|Action_B| 60.0|
|ID2| 23|2019-12-10 11:30:00|Action_A| 0.0|
|ID2| 23|2019-12-10 11:35:00|Action_B|300.0|
+---+------+-------------------+--------+-----+
# Your required data
dur_sdf.groupBy('id', 'seq_id'). \
agg((func.sum('dur') / func.lit(60)).alias('dur_mins')). \
show()
+---+------+--------+
| id|seq_id|dur_mins|
+---+------+--------+
|ID1| 15| 8.0|
|ID2| 23| 6.0|
+---+------+--------+
这符合您描述的数据,但请检查其是否适合您的所有情况。