Question

我正在运行一个简单的火花应用测试，该测试读取一年的数据并将相同数量的数据写入Hive，按天划分。在写之前，我将每个文件合并到15个分区，以便不会写入许多小文件。我想使用FAIR调度程序并行执行此操作;我的应用程序运行200个执行程序和4个核心（这意味着一次可以运行800个任务）这是池的配置

<allocations>
  <pool name="writing_pool">
    <schedulingMode>FAIR</schedulingMode>
    <minShare>400</minShare>
  </pool>
</allocations

想法是每个作业写10天的数据

dates.grouped(10).toSeq.par.foreach(s => {

  spark.sparkContext.setLocalProperty("spark.scheduler.pool", "writing_pool")

  print("submitting writes for " + s.mkString(","))

  val toBeInserted = dataWithDate
    .where(col("yyyy_mm_dd").isin(s:_*))

  toBeInserted.coalesce(15).write.mode(SaveMode.Overwrite)
    .insertInto("test_write_buffer_parallel_hive")
})

我不知道为什么，但我的群集并未完全使用，例如

正如您所看到的，此时38个任务仅在运行时，而我有200x4个可用插槽。你知道为什么吗？

如何正确设置spark fair scheduler和pool？

0 个答案: