Spark中有“关键状态地图”吗?

时间:2018-06-19 04:59:14

标签: apache-spark

我有一个RDD(或DataFrame)测量数据,按时间戳排序,我需要对同一个键的两个后续记录进行成对操作(例如,做一个梯形积分加速度计数据以获得速度)。

Spark中是否有一个函数“记住”每个密钥的最后一条记录,并且当同一密钥的下一条记录到达时它是否可用?

我目前想到了这种方法:

  1. 获取RDD的所有密钥
  2. 使用自定义Partitioner按找到的键对RDD进行分区,因此我知道每个键都有一个分区
  3. 使用mapPartitions进行计算
  4. 然而,这有一个缺陷:

    首先,获取密钥可能是一项非常漫长的任务,因为输入数据可能是几个GiB甚至是TiB大。我可以编写一个自定义InputFormat来提取明显更快的密钥(因为我使用Hadoop的API和sc.newAPIHadoopFile来获取数据)但这将是需要考虑的其他事项。另外一个错误来源。

    所以我的问题是:是否有reduceByKey之类的东西不会聚合数据但只给我当前记录和该密钥的最后一个记录,让我根据该信息输出一个或多个记录?

1 个答案:

答案 0 :(得分:1)

您可以使用dataframe

执行此操作
import java.sql.Timestamp
import org.apache.spark.sql.types.{TimestampType, IntegerType}
import org.apache.spark.sql.functions._

**Create a window for lag function** 
val w = org.apache.spark.sql.expressions.Window.partitionBy("key").orderBy("timestamp")

val df = spark.sparkContext.parallelize(List((1, 23, Timestamp.valueOf("2017-12-02 03:04:00")),
(1, 24, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 26, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 27, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 30, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 33, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 39, Timestamp.valueOf("2017-12-02 01:45:20")))).toDF("key","value","timestamp")

scala> df.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: integer (nullable = false)
 |-- timestamp: timestamp (nullable = true)


scala> val lagDF = df.withColumn("lag_value",lag("value", 1, 0).over(w))
lagDF: org.apache.spark.sql.DataFrame = [key: int, value: int ... 2 more fields]

**Previous record and current record is in same row now**
scala> lagDF.show(10, false)
+---+-----+-------------------+---------+
|key|value|timestamp          |lag_value|
+---+-----+-------------------+---------+
|1  |24   |2017-12-02 01:45:20|0        |
|1  |26   |2017-12-02 01:45:20|24       |
|1  |27   |2017-12-02 01:45:20|26       |
|1  |23   |2017-12-02 03:04:00|27       |
|2  |30   |2017-12-02 01:45:20|0        |
|2  |33   |2017-12-02 01:45:20|30       |
|2  |39   |2017-12-02 01:45:20|33       |
+---+-----+-------------------+---------+

**Put your distance calculation logic here. I'm putting dummy function for demo**
scala> val result = lagDF.withColumn("dummy_operation_for_dist_calc", lagDF("value") - lagDF("lag_value"))
result: org.apache.spark.sql.DataFrame = [key: int, value: int ... 3 more fields]

scala> result.show(10, false)
+---+-----+-------------------+---------+-----------------------------+
|key|value|timestamp          |lag_value|dummy_operation_for_dist_calc|
+---+-----+-------------------+---------+-----------------------------+
|1  |24   |2017-12-02 01:45:20|0        |24                           |
|1  |26   |2017-12-02 01:45:20|24       |2                            |
|1  |27   |2017-12-02 01:45:20|26       |1                            |
|1  |23   |2017-12-02 03:04:00|27       |-4                           |
|2  |30   |2017-12-02 01:45:20|0        |30                           |
|2  |33   |2017-12-02 01:45:20|30       |3                            |
|2  |39   |2017-12-02 01:45:20|33       |6                            |
+---+-----+-------------------+---------+-----------------------------+