在pyspark

时间:2019-12-10 10:51:34

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我想计算每个用户每个SeqID所花费的时间。我有一个这样的数据框。 但是,每个用户Action_A and Action_B.的时间在两个操作之间分配 所有用户对和每个seqID的总时间将是所有此类对的总和

对于第一个用户,它是5 + 3 [(2019-12-10 10:00:00 - 2019-12-10 10:05:00) + (2019-12-10 10:20:00 - 2019-12-10 10:23:00)]

因此,理想情况下,第一个用户已将8 mins用于SeqID 1(而不是23 mins)。

同样,用户2花费了1 + 5 = 6 mins

我如何使用pyspark进行计算?

data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")), 
        (("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
        (("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
        (("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
        (("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
        (("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
        (("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
        (("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.show()

+---+-----+-------------------+--------+
| ID|SeqID|          Timestamp|  Action|
+---+-----+-------------------+--------+
|ID1|   15|2019-12-10 10:00:00|Action_A|
|ID1|   15|2019-12-10 10:05:00|Action_B|
|ID1|   15|2019-12-10 10:20:00|Action_A|
|ID1|   15|2019-12-10 10:23:00|Action_B|
|ID2|   23|2019-12-10 11:10:00|Action_A|
|ID2|   23|2019-12-10 11:11:00|Action_B|
|ID2|   23|2019-12-10 11:30:00|Action_A|
|ID2|   23|2019-12-10 11:35:00|Action_B|
+---+-----+-------------------+--------+

一旦有了每对数据,我就可以对整个组(ID,SeqID)求和

预期的输出(也可能是秒)

+---+-----+--------+
| ID|SeqID|Dur_Mins|
+---+-----+--------+
|ID1|   15|       8|
|ID2|   23|       6|
+---+-----+--------+

2 个答案:

答案 0 :(得分:2)

这是使用Higher-Order Functions(火花> = 2.4)的可能解决方案:

transform_expr = "transform(ts_array, (x,i) -> (unix_timestamp(ts_array[i+1]) - unix_timestamp(x))/60 * ((i+1)%2))"

df.groupBy("ID", "SeqID").agg(array_sort(collect_list(col("Timestamp"))).alias("ts_array")) \
    .withColumn("transformed_ts_array", expr(transform_expr)) \
    .withColumn("Dur_Mins", expr("aggregate(transformed_ts_array, 0D, (acc, x) -> acc + coalesce(x, 0D))")) \
    .drop("transformed_ts_array", "ts_array") \
    .show(truncate=False)

步骤:

  1. 为每个组IDSeqID收集所有时间戳以进行排列,并按升序对其进行排序
  2. 使用lambda函数(x, i) => Double将变换应用于数组。其中x是实际元素,i是其索引。对于数组中的每个时间戳,我们将计算下一个时间戳的差异。并且我们乘以(i+1)%2以便仅使diff每2个对为2对(第一个与第二个,第三个与第四个,...),因为总是有2个动作。
  3. 最后,我们汇总转换的结果数组以求和所有元素。

输出:

+---+-----+--------+
|ID |SeqID|Dur_Mins|
+---+-----+--------+
|ID1|15   |8.0     |
|ID2|23   |6.0     |
+---+-----+--------+

答案 1 :(得分:1)

使用flatMapValuesrdd的一种可能的方法(可能也会很复杂)

使用您的data变量

df = spark.createDataFrame(data, ["id", "seq_id", "ts", "action"]). \
    withColumn('ts', func.col('ts').cast('timestamp'))

# func to calculate the duration | applied on each row
def getDur(groupedrows):
    """
    """

    res = []

    for row in groupedrows:
        if row.action == 'Action_A':
            frst_ts = row.ts
            dur = 0
        elif row.action == 'Action_B':
            dur = (row.ts - frst_ts).total_seconds()

        res.append([val for val in row] + [float(dur)])

    return res

# run the rules on the base df | row by row
# grouped on ID, SeqID - sorted on timestamp
dur_rdd = df.rdd. \
    groupBy(lambda k: (k.id, k.seq_id)). \
    flatMapValues(lambda r: getDur(sorted(r, key=lambda ok: ok.ts))). \
    values()

# specify final schema
dur_schema = df.schema. \
    add('dur', 'float')

# convert to DataFrame
dur_sdf = spark.createDataFrame(dur_rdd, dur_schema)

dur_sdf.orderBy('id', 'seq_id', 'ts').show()

+---+------+-------------------+--------+-----+
| id|seq_id|                 ts|  action|  dur|
+---+------+-------------------+--------+-----+
|ID1|    15|2019-12-10 10:00:00|Action_A|  0.0|
|ID1|    15|2019-12-10 10:05:00|Action_B|300.0|
|ID1|    15|2019-12-10 10:20:00|Action_A|  0.0|
|ID1|    15|2019-12-10 10:23:00|Action_B|180.0|
|ID2|    23|2019-12-10 11:10:00|Action_A|  0.0|
|ID2|    23|2019-12-10 11:11:00|Action_B| 60.0|
|ID2|    23|2019-12-10 11:30:00|Action_A|  0.0|
|ID2|    23|2019-12-10 11:35:00|Action_B|300.0|
+---+------+-------------------+--------+-----+

# Your required data
dur_sdf.groupBy('id', 'seq_id'). \
    agg((func.sum('dur') / func.lit(60)).alias('dur_mins')). \
    show()

+---+------+--------+
| id|seq_id|dur_mins|
+---+------+--------+
|ID1|    15|     8.0|
|ID2|    23|     6.0|
+---+------+--------+

这符合您描述的数据,但请检查其是否适合您的所有情况。