在权重相等的Databricks中的1个作业中运行2个流式查询的问题

时间:2019-09-17 00:37:31

标签: apache-spark spark-streaming databricks

我有以下示例代码,其中输入是来自azure事件中心的流数据源,并且我正在Databricks中运行此作业:

val input = spark.readStream
      .schema(schema)
      .option("sep", ",")
      .option("header", "true")
      .format("csv")
      .load("/some/sample/path")

// added some new features/columns
val res1 = input.withColumn("col1", lit("")).withColumn("col2", lit(""))

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "cook")
res1.writeStream
      .option("checkpointLocation", "/some/path")
      .queryName("cook")
      .format("avro")
      .option("path",  "/some/cooked/data/path")
      .start()

// added some more features/columns
val res2 = res1.withColumn("col3", lit("")).withColumn("col4", lit(""))

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "applicationMain")
res2.writeStream
      .option("checkpointLocation", "/some/path2")
      .queryName("applicationMain")
      .format("avro")
      .option("path",  "/some/other/data/path2")
      .start()

我的fairscheduler.xml是:

<?xml version="1.0"?>
<allocations>
<pool name="cook">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
</pool>
<pool name="applicationMain">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
</pool>
</allocations>

并且我在spark-submit中也有以下设置:

"--conf","spark.scheduler.mode=FAIR",
"--conf","spark.scheduler.allocation.file=/dbfs/fairscheduler.xml"

在Spark UI中,工作似乎做得很好:

enter image description here

问题: cook查询似乎正确地写入了数据,但是applicationMain查询似乎没有被执行,没有日志/错误/警告消息,尽管UI似乎表明它已执行,但是写入目录为空。

有人遇到过类似的问题吗?知道我可能会缺少什么吗?

谢谢!

0 个答案:

没有答案