我有一个包含这样数据的数组。
tagid,timestamp,listner,orgid,suborgid,rssi
[4,1496745915,718,4,3,0.30]
[2,1496745915,3878,4,3,0.20]
[4,1496745918,362,4,3,0.60]
[4,1496745913,362,4,3,0.60]
我想迭代这个数组并找到每个标签的最新10秒时间戳数据& listner。这是我的代码。
override def inputSchema: StructType =
StructType(StructField("time", StringType) :: StructField("tagid", StringType) :: StructField("listener", StringType) :: StructField("rssi", StringType) :: Nil)
override def initialize(buffer: org.apache.spark.sql.expressions.MutableAggregationBuffer): Unit = {
buffer(0) = Array[String]();
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getAs[WrappedArray[String]](0) :+ (input.getAs[String](0)+";"+
input.getAs[String](1)+";"+input.getAs[String](2));
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getAs[WrappedArray[String]](0) ++ buffer2.getAs[WrappedArray[String]](0)
}
override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
}
in_array包含所有数据。我不知道如何继续进行。非常感谢。
答案 0 :(得分:1)
我发现你正在尝试使用udaf
而对于初学者来说这是一场噩梦。顺便说一句udaf
将为每个组返回一行,并从聚合的rows
获取所有原始dataframe
将是另一场噩梦。
我假设你有一个数据为
的文本文件tagid,timestamp,listner,orgid,suborgid,rssi
4,1496745915,718,4,3,0.30
2,1496745915,3878,4,3,0.20
4,1496745918,362,4,3,0.60
4,1496745913,362,4,3,0.60
如果是,则将文件读取到数据帧非常简单
val df = sqlContext.read.format("csv").option("header", true).load("path to the above file")
df.show(false)
这应该为您提供数据框
+-----+----------+-------+-----+--------+----+
|tagid|timestamp |listner|orgid|suborgid|rssi|
+-----+----------+-------+-----+--------+----+
|4 |1496745915|718 |4 |3 |0.30|
|2 |1496745915|3878 |4 |3 |0.20|
|4 |1496745918|362 |4 |3 |0.60|
|4 |1496745913|362 |4 |3 |0.60|
+-----+----------+-------+-----+--------+----+
现在,您希望仅过滤每个tagid和listner的最新时间戳10秒内的数据。对于此用法,请使用以下代码
val windowSpec = Window
.orderBy($"timestamp".desc) //latest to come first
.partitionBy("tagid", "listner")//grouping data
您必须为上面创建的每个组的每一行添加最新时间戳,以便您可以找到时差。为此,请关注
df.withColumn("firstValue", first("timestamp") over windowSpec)
这将创建一个新列
+-----+----------+-------+-----+--------+----+----------+
|tagid|timestamp |listner|orgid|suborgid|rssi|firstValue|
+-----+----------+-------+-----+--------+----+----------+
|2 |1496745915|3878 |4 |3 |0.20|1496745915|
|4 |1496745915|718 |4 |3 |0.30|1496745915|
|4 |1496745918|362 |4 |3 |0.60|1496745918|
|4 |1496745913|362 |4 |3 |0.60|1496745918|
+-----+----------+-------+-----+--------+----+----------+
下一步很简单,只是检查时差是否小于10并过滤
df.filter($"firstValue".cast("long")-$"timestamp".cast("long") < 10)
最后删除不需要的列的时间
df.drop("firstValue")
我希望答案清晰明白
如果将时间戳转换为实时时间戳
,则更清楚+-----+-------------------+-------+-----+--------+----+-------------------+---------+
|tagid|timestamp |listner|orgid|suborgid|rssi|firstValue |differnce|
+-----+-------------------+-------+-----+--------+----+-------------------+---------+
|2 |2017-06-06 16:30:15|3878 |4 |3 |0.20|2017-06-06 16:30:15|0 |
|4 |2017-06-06 16:30:15|718 |4 |3 |0.30|2017-06-06 16:30:15|0 |
|4 |2017-06-06 16:30:18|362 |4 |3 |0.60|2017-06-06 16:30:18|0 |
|4 |2017-06-06 16:30:13|362 |4 |3 |0.60|2017-06-06 16:30:18|5 |
+-----+-------------------+-------+-----+--------+----+-------------------+---------+
答案 1 :(得分:0)
首先,您不是在遍历数组。您的“数组”实际上是一个模式,您应该以这种方式定义数据帧(即每个元素应该是一列)。如果您的数据框包含字符串数组,则可以使用udf创建列(请参阅here
接下来,您应该将时间戳转换为时间戳类型,以便可以订购。
最后,您可以为两列中的每一列
执行argmax(请参阅here)答案 2 :(得分:0)
假设这是你的数组
val arr = Array((4,1499340495,718,4,3,0.30),
(2,1496745915,3878,4,3,0.20),
(4,1499340495,362,4,3,0.60),
(4,1496745913,362,4,3,0.60))
java.time.Instant
在Java 8中可用
import java.time.instant
arr.filter( x => (Instant.now.getEpochSecond - x._2) <= 10 )