我有一个RDD
(或DataFrame
)测量数据,按时间戳排序,我需要对同一个键的两个后续记录进行成对操作(例如,做一个梯形积分加速度计数据以获得速度)。
Spark中是否有一个函数“记住”每个密钥的最后一条记录,并且当同一密钥的下一条记录到达时它是否可用?
我目前想到了这种方法:
Partitioner
按找到的键对RDD进行分区,因此我知道每个键都有一个分区mapPartitions
进行计算然而,这有一个缺陷:
首先,获取密钥可能是一项非常漫长的任务,因为输入数据可能是几个GiB甚至是TiB大。我可以编写一个自定义InputFormat
来提取明显更快的密钥(因为我使用Hadoop的API和sc.newAPIHadoopFile
来获取数据)但这将是需要考虑的其他事项。另外一个错误来源。
所以我的问题是:是否有reduceByKey
之类的东西不会聚合数据但只给我当前记录和该密钥的最后一个记录,让我根据该信息输出一个或多个记录?
答案 0 :(得分:1)
您可以使用dataframe
执行此操作import java.sql.Timestamp
import org.apache.spark.sql.types.{TimestampType, IntegerType}
import org.apache.spark.sql.functions._
**Create a window for lag function**
val w = org.apache.spark.sql.expressions.Window.partitionBy("key").orderBy("timestamp")
val df = spark.sparkContext.parallelize(List((1, 23, Timestamp.valueOf("2017-12-02 03:04:00")),
(1, 24, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 26, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 27, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 30, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 33, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 39, Timestamp.valueOf("2017-12-02 01:45:20")))).toDF("key","value","timestamp")
scala> df.printSchema
root
|-- key: integer (nullable = false)
|-- value: integer (nullable = false)
|-- timestamp: timestamp (nullable = true)
scala> val lagDF = df.withColumn("lag_value",lag("value", 1, 0).over(w))
lagDF: org.apache.spark.sql.DataFrame = [key: int, value: int ... 2 more fields]
**Previous record and current record is in same row now**
scala> lagDF.show(10, false)
+---+-----+-------------------+---------+
|key|value|timestamp |lag_value|
+---+-----+-------------------+---------+
|1 |24 |2017-12-02 01:45:20|0 |
|1 |26 |2017-12-02 01:45:20|24 |
|1 |27 |2017-12-02 01:45:20|26 |
|1 |23 |2017-12-02 03:04:00|27 |
|2 |30 |2017-12-02 01:45:20|0 |
|2 |33 |2017-12-02 01:45:20|30 |
|2 |39 |2017-12-02 01:45:20|33 |
+---+-----+-------------------+---------+
**Put your distance calculation logic here. I'm putting dummy function for demo**
scala> val result = lagDF.withColumn("dummy_operation_for_dist_calc", lagDF("value") - lagDF("lag_value"))
result: org.apache.spark.sql.DataFrame = [key: int, value: int ... 3 more fields]
scala> result.show(10, false)
+---+-----+-------------------+---------+-----------------------------+
|key|value|timestamp |lag_value|dummy_operation_for_dist_calc|
+---+-----+-------------------+---------+-----------------------------+
|1 |24 |2017-12-02 01:45:20|0 |24 |
|1 |26 |2017-12-02 01:45:20|24 |2 |
|1 |27 |2017-12-02 01:45:20|26 |1 |
|1 |23 |2017-12-02 03:04:00|27 |-4 |
|2 |30 |2017-12-02 01:45:20|0 |30 |
|2 |33 |2017-12-02 01:45:20|30 |3 |
|2 |39 |2017-12-02 01:45:20|33 |6 |
+---+-----+-------------------+---------+-----------------------------+