Spark:用于数据框的Python窗口函数

时间:2015-11-25 16:25:08

标签: python sql apache-spark apache-spark-sql spark-streaming

用例是捕获流传感器条目之间的时间差,其中工作站和部件相同以与公差进行比较,如果超出范围则可能触发警报。我目前正在将字段解析为数据框并将其注册为表,以使用LAG函数执行SQL查询。

events = rawFilter.map(lambda x: x.split("|")).map(lambda x: (x[0], x[1], x[2]))
eventSchema = StructType(
  [StructField("station", StringType(), False),
  StructField("part", StringType(), False),
  StructField("event", TimestampType(), False)])

eventDF = sqlContext.createDataFrame(events,eventSchema)
eventDF.registerTempTable("events_table")

%sql select station, part, event, prev_event, 
    cast(event as double) - cast(prev_event as double) as CycleTime 
    from (select station, part, event, 
    LAG(event) over (Partition BY station, part Order BY event) as Prev_Event 
    from events_table) x limit 10

Example Streaming Sensor Data:
station1|part1|<timestamp>
station2|part2|<timestamp>
station3|part3|<timestamp>
station1|part1|<timestamp>
station1|part1|<timestamp>
station1|part1|<timestamp>
station3|part3|<timestamp>
station1|part1|<timestamp>

我想要了解的是如何在数据框中完成窗口函数,以便生成的表格已经计算出时差?

这个问题的第2部分是了解如何处理零件的变化。在这种情况下,不应计算或停止CycleTime;但是,同一站的两个不同部分之间的时间差是另一个名为ChangeOver的计算。我不知道如何通过Spark Streaming完成这项工作,因为窗口可能会在部件更改之前延长几天。所以我在考虑将数据推送到Hbase或其他东西来计算ChangeOver。

1 个答案:

答案 0 :(得分:0)

DataFrames上的窗口定义严格遵循SQL约定,使用与等效SQL子句对应的partitionByorderByrangeBetweenrowsBetween方法。

from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window

rawDF = sc.parallelize([
    ("station1", "part1", "2015-01-03 00:11:02"),
    ("station2", "part2", "2015-02-00 10:20:10"),
    ("station3", "part3", "2015-03-02 00:30:00"),
    ("station1", "part1", "2015-05-00 01:07:00"),
    ("station1", "part1", "2015-01-13 05:16:10"),
    ("station1", "part1", "2015-11-20 10:22:40"),
    ("station3", "part3", "2015-09-04 03:15:22"),
    ("station1", "part1", "2015-03-05 00:41:33")
]).toDF(["station", "part", "event"])

eventDF = rawDF.withColumn("event", unix_timestamp(col("event")))

w = Window.partitionBy(col("station")).orderBy(col("event"))

(eventDF
  .withColumn("prev_event", lag(col("event")).over(w))
  .withColumn("cycle_time", col("event") - col("prev_event")))