我正在处理CSV数据集并使用Spark Streaming处理。我可以在Spark Streaming中使用窗口功能应用批处理。有没有办法可以在不使用聚合功能的情况下使用Spark结构化流进行相同的操作?互联网上所有可用的示例都使用groupBy选项。我只想使用结构化流将数据分成批次,而无需任何聚合。
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import *
def foreach_batch_function(df, epoch_id):
#df = df.select(split('value',','))
#df.show()
print(type(df))
df = df.toPandas()
df = df.value.str.split("," ,expand=True)
df.show()
spark = SparkSession.builder.appName("TurbineDataAnalytics").getOrCreate()
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 8887).load()
lines = lines.groupBy(window(lines.value, "10 minutes", "5 minutes"), lines.value).count()
query = lines.writeStream.foreachBatch(foreach_batch_function).start()
query.awaitTermination()
样本数据:
Date_Time,Rt_avg,Q_avg,Rs_avg,Rm_avg,Ws_avg,Nu_avg
12/31/16 18:00,12,12.18,9.3500004,742.70001,4.5599999,700.33002
12/31/16 18:10,12,11.35,9.4799995,788.98999,4.9899998,698.03998
12/31/16 18:20,12,11.05,9.2399998,654.10999,4.8400002,700.16998
12/31/16 18:30,12,12,9.5,795.71997,4.6999998,699.37
答案 0 :(得分:0)
根据您在注释中提到的内容,您想了解如何拆分数据框的value列以及如何在不使用groupby的情况下应用滑动窗口。
您可以使用split函数拆分value列,并通过选择应用滑动窗口。看看下面的伪代码:
import pyspark.sql.functions as F
#readstream
lines = lines.select(lines.value)
split_col = F.split(df.value, ',')
lines = lines.withColumn('Date_Time', split_col.getItem(0))
lines = lines.withColumn('Rt_avg', split_col.getItem(1))
lines = lines.withColumn('Q_avg', split_col.getItem(2))
lines = lines.withColumn('Rs_avg', split_col.getItem(3))
lines = lines.withColumn('Rm_avg', split_col.getItem(4))
lines = lines.withColumn('Ws_avg',split_col.getItem(5))
lines = lines.withColumn('Nu_avg',split_col.getItem(6))
w = lines.select(F.window("Date_Time", "5 seconds"))
#writeStream