在scala中使用平均函数进行分组

时间:2017-07-06 09:22:13

标签: scala apache-spark apache-spark-sql

嗨,我是一个全新的引发scala。我需要一个想法或任何样本解决方案。我有这样的数据

tagid,timestamp,listner,orgid,suborgid,rssi
[4,1496745915,718,4,3,0.30]
[2,1496745915,3878,4,3,0.20]
[4,1496745918,362,4,3,0.60]
[4,1496745913,362,4,3,0.60]
[2,1496745918,362,4,3,0.10]
[3,1496745912,718,4,3,0.05]
[2,1496745918,718,4,3,0.30]
[4,1496745911,1901,4,3,0.60]
[4,1496745912,718,4,3,0.60]
[2,1496745915,362,4,3,0.30]
[2,1496745912,3878,4,3,0.20]
[2,1496745915,1901,4,3,0.30]
[2,1496745910,1901,4,3,0.30]

我想找到每个标签和每个列表器最后10秒的时间戳数据。然后对于10秒数据,我需要找到rssi值的平均值。就像这样。

2,1496745918,718,4,3,0.60
2,1496745917,718,4,3,1.30
2,1496745916,718,4,1,2.20
2,1496745914,718,1,2,3.10
2,1496745911,718,1,2,6.10
4,1496745910,1901,1,2,0.30
4,1496745908,1901,1,2,1.30
..........................
..........................

像这样我需要找到它。任何解决方案或建议表示赞赏。 注意:我正在使用spark scala。

我试过了spark sql查询。但是没有正常工作。

val filteravg = avg.registerTempTable("avg")
val avgfinal = sqlContext.sql("SELECT tagid,timestamp,listner FROM (SELECT tagid,timestamp,listner,dense_rank() OVER (PARTITION BY _c6 ORDER BY _c5 ASC) as rank FROM avg) tmp WHERE rank <= 10")
avgfinal.collect.foreach(println)

我也正在尝试通过阵列。任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:3)

如果您已将数据框设为

+-----+----------+-------+-----+--------+----+
|tagid|timestamp |listner|orgid|suborgid|rssi|
+-----+----------+-------+-----+--------+----+
|4    |1496745915|718    |4    |3       |0.30|
|2    |1496745915|3878   |4    |3       |0.20|
|4    |1496745918|362    |4    |3       |0.60|
|4    |1496745913|362    |4    |3       |0.60|
|2    |1496745918|362    |4    |3       |0.10|
|3    |1496745912|718    |4    |3       |0.05|
|2    |1496745918|718    |4    |3       |0.30|
|4    |1496745911|1901   |4    |3       |0.60|
|4    |1496745912|718    |4    |3       |0.60|
|2    |1496745915|362    |4    |3       |0.30|
|2    |1496745912|3878   |4    |3       |0.20|
|2    |1496745915|1901   |4    |3       |0.30|
|2    |1496745910|1901   |4    |3       |0.30|
+-----+----------+-------+-----+--------+----+

执行以下操作应该适合您

  df.withColumn("firstValue", first("timestamp") over Window.orderBy($"timestamp".desc).partitionBy("tagid"))
  .filter($"firstValue".cast("long")-$"timestamp".cast("long") < 10)
  .withColumn("average", avg("rssi") over Window.partitionBy("tagid"))
  .drop("firstValue")
  .show(false)

你应该输出

+-----+----------+-------+-----+--------+----+-------------------+
|tagid|timestamp |listner|orgid|suborgid|rssi|average            |
+-----+----------+-------+-----+--------+----+-------------------+
|3    |1496745912|718    |4    |3       |0.05|0.05               |
|4    |1496745918|362    |4    |3       |0.60|0.54               |
|4    |1496745915|718    |4    |3       |0.30|0.54               |
|4    |1496745913|362    |4    |3       |0.60|0.54               |
|4    |1496745912|718    |4    |3       |0.60|0.54               |
|4    |1496745911|1901   |4    |3       |0.60|0.54               |
|2    |1496745918|362    |4    |3       |0.10|0.24285714285714288|
|2    |1496745918|718    |4    |3       |0.30|0.24285714285714288|
|2    |1496745915|3878   |4    |3       |0.20|0.24285714285714288|
|2    |1496745915|362    |4    |3       |0.30|0.24285714285714288|
|2    |1496745915|1901   |4    |3       |0.30|0.24285714285714288|
|2    |1496745912|3878   |4    |3       |0.20|0.24285714285714288|
|2    |1496745910|1901   |4    |3       |0.30|0.24285714285714288|
+-----+----------+-------+-----+--------+----+-------------------+