我正在研究Spark: The Definitive Guide,并尝试测试第3章中讨论的结构化流API。我在本地模式下启动spark-shell
,并运行以下命令来模拟csv流文件。
import org.apache.spark.sql.functions.window
spark.conf.set("spark.sql.shuffle.partitions", "5")
val staticDataFrame = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/data/retail-data/by-day/*.csv")
val streamingDataFrame = spark.readStream
.schema(staticSchema)
.option("maxFilesPerTrigger", 1)
.format("csv")
.option("header", "true")
.load("/data/retail-data/by-day/*.csv")
val purchaseByCustomerPerHour = streamingDataFrame
.selectExpr(
"CustomerId",
"(UnitPrice * Quantity) as total_cost",
"InvoiceDate")
.groupBy(
$"CustomerId", window($"InvoiceDate", "1 day"))
.sum("total_cost")
然后我通过运行
开始流式作业。purchaseByCustomerPerHour.writeStream
.format("memory") // memory = store in-memory table
.queryName("customer_purchases") // the name of the in-memory table
.outputMode("complete") // complete = all the counts should be in the table
.start()
此时,控制台开始每2-3秒记录一次任务号以及流生成的文件数。这个过程似乎会无限期地持续下去。
我现在想针对正在运行的流运行以下查询。
spark.sql("""
SELECT *
FROM customer_purchases
ORDER BY `sum(total_cost)` DESC
""")
.show(5)
不幸的是,我无法弄清楚如何将第二个spark-shell
实例连接到包含流处理过程和“ customer_purchases”表/视图的现有Spark上下文。这可能吗?