在 Azure Databricks 中同时运行笔记本

时间:2021-07-23 09:18:43

标签: pyspark databricks azure-databricks

在 Azure Databricks 中,我有一个 pyspark 笔记本,必须使用不同的参数执行大约 10 到 20 次。由于各个任务不相互依赖并且不需要很多性能,我希望同时运行它们以加快进程。

我使用这个涉及期货的 python 代码从主笔记本并行运行笔记本定义列表:

def executeNotebook(notebook:NotebookData):
  print(f"Executing notebook {notebook.path}")
  try:
    return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.getParameters())
  except Exception as e:
    if notebook.retry < 1:
      failed = json.dumps({"status" : "failed", "error" : str(e), "notebook" : notebook.path})
      raise Exception(failed) 
    print(f"Retrying notebook {notebook.path}")
    notebook.retry -= 1

def tryFuture(future:Future):
  try:
    return future.result()
  except Exception as e:
    return str(e)

def parallelNotebooks(notebooks:List[NotebookData], maxParallel:int):
  print(f"Executing {len(notebooks)} notebooks with a maxParallel of {maxParallel}")
  with ThreadPoolExecutor(max_workers=maxParallel) as executor:
    results = [executor.submit(executeNotebook, notebook) for notebook in notebooks if notebook.enabled] 
    return [tryFuture(r) for r in results]

这在大多数情况下都有效,但是有时执行的笔记本列表中缺少一个或多个笔记本,Databricks 会出现以下错误。

Context not valid. If you are calling this outside the main thread, you must set the Notebook context via dbutils.notebook.setContext(ctx), where ctx is a value retrieved from the main thread (and the same cell) via dbutils.notebook.getContext()

有没有人知道我如何防止这种情况发生?我从哪里获得这个上下文,我必须在哪里设置它?

提前致谢!

0 个答案:

没有答案