如何映射scala中的相邻元素

时间:2017-03-01 11:41:27

标签: scala apache-spark

我根据设备,时间戳,开/关格式有RDD [String]。我如何计算每个设备被打开的时间。在spark中执行此操作的最佳方法是什么?

表示平均值1,off表示0表示

E.g

A,1335952933,1
A,1335953754,0
A,1335994294,1
A,1335995228,0
B,1336001513,1
B,1336002622,0
B,1336006905,1
B,1336007462,0

中级第1步

A,((1335953754 - 1335952933),(1335995228 - 1335994294))
B,((1336002622- 1336001513),(1336007462 - 1336006905))

中级第2步

(A,(821,934))
(B,(1109,557))

输出

(A,1755)
(B,1666)

1 个答案:

答案 0 :(得分:2)

我假设RDD [String]可以解析为DeviceLog的RDD,其中DeviceLog是:

case class DeviceLog(val id: String, val timestamp: Long, val onoff: Int)

DeviceLog类很简单。

// initialize contexts
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)

初始化我们将用于数据帧的spark上下文和sql上下文。

第1步

val input = List(
    DeviceLog("A",1335952933,1),
    DeviceLog("A",1335953754,0),
    DeviceLog("A",1335994294,1),
    DeviceLog("A",1335995228,0),
    DeviceLog("B",1336001513,1),
    DeviceLog("B",1336002622,0),
    DeviceLog("B",1336006905,1),
    DeviceLog("B",1336007462,0))

val df = input.toDF()
df.show()
+---+----------+-----+
| id| timestamp|onoff|
+---+----------+-----+
|  A|1335952933|    1|
|  A|1335953754|    0|
|  A|1335994294|    1|
|  A|1335995228|    0|
|  B|1336001513|    1|
|  B|1336002622|    0|
|  B|1336006905|    1|
|  B|1336007462|    0|
+---+----------+-----+

步骤2:按设备ID分区,按时间戳排序并保留配对信息(开/关)

val wSpec = Window.partitionBy("id").orderBy("timestamp")

    val df1 = df
      .withColumn("spend", lag("timestamp", 1).over(wSpec))
      .withColumn("one", lag("onoff", 1).over(wSpec))
      .where($"spend" isNotNull)
    df1.show()

+---+----------+-----+----------+---+
| id| timestamp|onoff|     spend|one|
+---+----------+-----+----------+---+
|  A|1335953754|    0|1335952933|  1|
|  A|1335994294|    1|1335953754|  0|
|  A|1335995228|    0|1335994294|  1|
|  B|1336002622|    0|1336001513|  1|
|  B|1336006905|    1|1336002622|  0|
|  B|1336007462|    0|1336006905|  1|
+---+----------+-----+----------+---+

步骤3:计算upTime并按标准过滤

val df2 = df1
      .withColumn("upTime", $"timestamp" - $"spend")
      .withColumn("criteria", $"one" - $"onoff")
      .where($"criteria" === 1)
    df2.show()

| id| timestamp|onoff|     spend|one|upTime|criteria|
+---+----------+-----+----------+---+------+--------+
|  A|1335953754|    0|1335952933|  1|   821|       1|
|  A|1335995228|    0|1335994294|  1|   934|       1|
|  B|1336002622|    0|1336001513|  1|  1109|       1|
|  B|1336007462|    0|1336006905|  1|   557|       1|
+---+----------+-----+----------+---+------+--------+

第4步:按ID和总和分组

val df3 = df2.groupBy($"id").agg(sum("upTime"))
    df3.show()

+---+-----------+
| id|sum(upTime)|
+---+-----------+
|  A|       1755|
|  B|       1666|
+---+-----------+