在scala中迭代的数组

时间:2017-07-06 11:10:21

标签: arrays scala apache-spark

我有一个包含这样数据的数组。

tagid,timestamp,listner,orgid,suborgid,rssi
[4,1496745915,718,4,3,0.30]
[2,1496745915,3878,4,3,0.20]
[4,1496745918,362,4,3,0.60]
[4,1496745913,362,4,3,0.60]

我想迭代这个数组并找到每个标签的最新10秒时间戳数据& listner。这是我的代码。

 override def inputSchema: StructType =
 StructType(StructField("time", StringType) :: StructField("tagid", StringType) :: StructField("listener", StringType) :: StructField("rssi", StringType) :: Nil)

   override def initialize(buffer: org.apache.spark.sql.expressions.MutableAggregationBuffer): Unit = {
buffer(0) = Array[String]();
}

override def update(buffer: MutableAggregationBuffer, input: Row): Unit =      {
buffer(0) = buffer.getAs[WrappedArray[String]](0) :+ (input.getAs[String](0)+";"+
  input.getAs[String](1)+";"+input.getAs[String](2));
}

override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit     =  {
buffer1(0) = buffer1.getAs[WrappedArray[String]](0) ++ buffer2.getAs[WrappedArray[String]](0)
}

override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
}

in_array包含所有数据。我不知道如何继续进行。非常感谢。

3 个答案:

答案 0 :(得分:1)

我发现你正在尝试使用udaf而对于初学者来说这是一场噩梦。顺便说一句udaf将为每个组返回一行,并从聚合的rows获取所有原始dataframe将是另一场噩梦。

我假设你有一个数据为

的文本文件
tagid,timestamp,listner,orgid,suborgid,rssi
4,1496745915,718,4,3,0.30
2,1496745915,3878,4,3,0.20
4,1496745918,362,4,3,0.60
4,1496745913,362,4,3,0.60

如果是,则将文件读取到数据帧非常简单

val df = sqlContext.read.format("csv").option("header", true).load("path to the above file")
df.show(false)

这应该为您提供数据框

+-----+----------+-------+-----+--------+----+
|tagid|timestamp |listner|orgid|suborgid|rssi|
+-----+----------+-------+-----+--------+----+
|4    |1496745915|718    |4    |3       |0.30|
|2    |1496745915|3878   |4    |3       |0.20|
|4    |1496745918|362    |4    |3       |0.60|
|4    |1496745913|362    |4    |3       |0.60|
+-----+----------+-------+-----+--------+----+

现在,您希望仅过滤每个tagid和listner的最新时间戳10秒内的数据。对于此用法,请使用以下代码

val windowSpec = Window
                    .orderBy($"timestamp".desc)  //latest to come first
                    .partitionBy("tagid", "listner")//grouping data

您必须为上面创建的每个组的每一行添加最新时间戳,以便您可以找到时差。为此,请关注

df.withColumn("firstValue", first("timestamp") over windowSpec)

这将创建一个新列

+-----+----------+-------+-----+--------+----+----------+
|tagid|timestamp |listner|orgid|suborgid|rssi|firstValue|
+-----+----------+-------+-----+--------+----+----------+
|2    |1496745915|3878   |4    |3       |0.20|1496745915|
|4    |1496745915|718    |4    |3       |0.30|1496745915|
|4    |1496745918|362    |4    |3       |0.60|1496745918|
|4    |1496745913|362    |4    |3       |0.60|1496745918|
+-----+----------+-------+-----+--------+----+----------+

下一步很简单,只是检查时差是否小于10并过滤

df.filter($"firstValue".cast("long")-$"timestamp".cast("long") < 10)

最后删除不需要的列的时间

df.drop("firstValue")

我希望答案清晰明白

如果将时间戳转换为实时时间戳

,则更清楚
+-----+-------------------+-------+-----+--------+----+-------------------+---------+
|tagid|timestamp          |listner|orgid|suborgid|rssi|firstValue         |differnce|
+-----+-------------------+-------+-----+--------+----+-------------------+---------+
|2    |2017-06-06 16:30:15|3878   |4    |3       |0.20|2017-06-06 16:30:15|0        |
|4    |2017-06-06 16:30:15|718    |4    |3       |0.30|2017-06-06 16:30:15|0        |
|4    |2017-06-06 16:30:18|362    |4    |3       |0.60|2017-06-06 16:30:18|0        |
|4    |2017-06-06 16:30:13|362    |4    |3       |0.60|2017-06-06 16:30:18|5        |
+-----+-------------------+-------+-----+--------+----+-------------------+---------+

答案 1 :(得分:0)

首先,您不是在遍历数组。您的“数组”实际上是一个模式,您应该以这种方式定义数据帧(即每个元素应该是一列)。如果您的数据框包含字符串数组,则可以使用udf创建列(请参阅here

接下来,您应该将时间戳转换为时间戳类型,以便可以订购。

最后,您可以为两列中的每一列

执行argmax(请参阅here

答案 2 :(得分:0)

假设这是你的数组

val arr = Array((4,1499340495,718,4,3,0.30),
                (2,1496745915,3878,4,3,0.20),
                (4,1499340495,362,4,3,0.60),
                (4,1496745913,362,4,3,0.60))

java.time.Instant在Java 8中可用

import java.time.instant

arr.filter( x => (Instant.now.getEpochSecond - x._2) <= 10 )