用例是捕获流传感器条目之间的时间差,其中工作站和部件相同以与公差进行比较,如果超出范围则可能触发警报。我目前正在将字段解析为数据框并将其注册为表,以使用LAG函数执行SQL查询。
events = rawFilter.map(lambda x: x.split("|")).map(lambda x: (x[0], x[1], x[2]))
eventSchema = StructType(
[StructField("station", StringType(), False),
StructField("part", StringType(), False),
StructField("event", TimestampType(), False)])
eventDF = sqlContext.createDataFrame(events,eventSchema)
eventDF.registerTempTable("events_table")
%sql select station, part, event, prev_event,
cast(event as double) - cast(prev_event as double) as CycleTime
from (select station, part, event,
LAG(event) over (Partition BY station, part Order BY event) as Prev_Event
from events_table) x limit 10
Example Streaming Sensor Data:
station1|part1|<timestamp>
station2|part2|<timestamp>
station3|part3|<timestamp>
station1|part1|<timestamp>
station1|part1|<timestamp>
station1|part1|<timestamp>
station3|part3|<timestamp>
station1|part1|<timestamp>
我想要了解的是如何在数据框中完成窗口函数,以便生成的表格已经计算出时差?
这个问题的第2部分是了解如何处理零件的变化。在这种情况下,不应计算或停止CycleTime;但是,同一站的两个不同部分之间的时间差是另一个名为ChangeOver的计算。我不知道如何通过Spark Streaming完成这项工作,因为窗口可能会在部件更改之前延长几天。所以我在考虑将数据推送到Hbase或其他东西来计算ChangeOver。
答案 0 :(得分:0)
DataFrames
上的窗口定义严格遵循SQL约定,使用与等效SQL子句对应的partitionBy
,orderBy
,rangeBetween
和rowsBetween
方法。
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
rawDF = sc.parallelize([
("station1", "part1", "2015-01-03 00:11:02"),
("station2", "part2", "2015-02-00 10:20:10"),
("station3", "part3", "2015-03-02 00:30:00"),
("station1", "part1", "2015-05-00 01:07:00"),
("station1", "part1", "2015-01-13 05:16:10"),
("station1", "part1", "2015-11-20 10:22:40"),
("station3", "part3", "2015-09-04 03:15:22"),
("station1", "part1", "2015-03-05 00:41:33")
]).toDF(["station", "part", "event"])
eventDF = rawDF.withColumn("event", unix_timestamp(col("event")))
w = Window.partitionBy(col("station")).orderBy(col("event"))
(eventDF
.withColumn("prev_event", lag(col("event")).over(w))
.withColumn("cycle_time", col("event") - col("prev_event")))