Spark结构化流式传输-是否可以在不聚合的情况下使用Spark结构化流式传输窗口功能?

时间:2019-06-08 21:53:03

标签: apache-spark pyspark spark-structured-streaming

我正在处理CSV数据集并使用Spark Streaming处理。我可以在Spark Streaming中使用窗口功能应用批处理。有没有办法可以在不使用聚合功能的情况下使用Spark结构化流进行相同的操作?互联网上所有可用的示例都使用groupBy选项。我只想使用结构化流将数据分成批次,而无需任何聚合。

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import *

def foreach_batch_function(df, epoch_id):
    #df = df.select(split('value',','))
    #df.show()
    print(type(df))
    df = df.toPandas()
    df = df.value.str.split("," ,expand=True)
    df.show()

spark = SparkSession.builder.appName("TurbineDataAnalytics").getOrCreate()

lines = spark.readStream.format("socket").option("host", "localhost").option("port", 8887).load()

lines = lines.groupBy(window(lines.value, "10 minutes", "5 minutes"), lines.value).count()

query = lines.writeStream.foreachBatch(foreach_batch_function).start()

query.awaitTermination()

样本数据:

Date_Time,Rt_avg,Q_avg,Rs_avg,Rm_avg,Ws_avg,Nu_avg

12/31/16 18:00,12,12.18,9.3500004,742.70001,4.5599999,700.33002

12/31/16 18:10,12,11.35,9.4799995,788.98999,4.9899998,698.03998

12/31/16 18:20,12,11.05,9.2399998,654.10999,4.8400002,700.16998

12/31/16 18:30,12,12,9.5,795.71997,4.6999998,699.37

1 个答案:

答案 0 :(得分:0)

根据您在注释中提到的内容,您想了解如何拆分数据框的value列以及如何在不使用groupby的情况下应用滑动窗口。

您可以使用split函数拆分value列,并通过选择应用滑动窗口。看看下面的伪代码:

import pyspark.sql.functions as F
#readstream
lines = lines.select(lines.value)

split_col = F.split(df.value, ',')

lines = lines.withColumn('Date_Time', split_col.getItem(0))
lines = lines.withColumn('Rt_avg', split_col.getItem(1))
lines = lines.withColumn('Q_avg', split_col.getItem(2))
lines = lines.withColumn('Rs_avg', split_col.getItem(3))
lines = lines.withColumn('Rm_avg', split_col.getItem(4))
lines = lines.withColumn('Ws_avg',split_col.getItem(5))
lines = lines.withColumn('Nu_avg',split_col.getItem(6))

w = lines.select(F.window("Date_Time", "5 seconds"))
#writeStream