Question

我正在研究Spark: The Definitive Guide，并尝试测试第3章中讨论的结构化流API。我在本地模式下启动spark-shell，并运行以下命令来模拟csv流文件。

import org.apache.spark.sql.functions.window

spark.conf.set("spark.sql.shuffle.partitions", "5")

val staticDataFrame = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/data/retail-data/by-day/*.csv")

val streamingDataFrame = spark.readStream
    .schema(staticSchema)
    .option("maxFilesPerTrigger", 1)
    .format("csv")
    .option("header", "true")
    .load("/data/retail-data/by-day/*.csv")

val purchaseByCustomerPerHour = streamingDataFrame
  .selectExpr(
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost",
    "InvoiceDate")
  .groupBy(
    $"CustomerId", window($"InvoiceDate", "1 day"))
  .sum("total_cost")

然后我通过运行

开始流式作业。

purchaseByCustomerPerHour.writeStream
    .format("memory") // memory = store in-memory table
    .queryName("customer_purchases") // the name of the in-memory table
    .outputMode("complete") // complete = all the counts should be in the table
    .start()

此时，控制台开始每2-3秒记录一次任务号以及流生成的文件数。这个过程似乎会无限期地持续下去。

我现在想针对正在运行的流运行以下查询。

spark.sql("""
  SELECT *
  FROM customer_purchases
  ORDER BY `sum(total_cost)` DESC
  """)
  .show(5)

不幸的是，我无法弄清楚如何将第二个spark-shell实例连接到包含流处理过程和“ customer_purchases”表/视图的现有Spark上下文。这可能吗？

如何将多个spark-shell连接到同一SparkContext以测试结构化流？

0 个答案: