我正在尝试通过从一堆使用 pyspark 读取的 csv 文件中流式传输数据来编写增量学习算法(例如 sklearn SGD Classifier)。如何将通过 spark.readStream
获得的数据转换为下游机器学习算法?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType
fieldnames = ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "y"]
fieldtypes = [StructField(i, FloatType(), True) for i in fieldnames]
schema = StructType(fieldtypes)
spark = SparkSession.builder.appName('streamcsv').getOrCreate()
agr_data = spark.readStream.format("csv").schema(schema).option("header", False).\
option("maxFilesPerTrigger", 1).load("../datasets/spark")
# access agr_data as dictionary or pandas df
# run sklearn partial_fit on that data
我尝试了以下代码:
query = agr_data.writeStream.format("memory").queryName("data_sink").start()
agr_data \
.writeStream \
.queryName("aggregates") \
.outputMode("append") \
.format("memory") \
.start()
spark.sql("select * from aggregates").show()
我把它作为输出
+---+---+---+---+---+---+---+---+---+---+
| x0| x1| x2| x3| x4| x5| x6| x7| x8| y|
+---+---+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+---+---+