Question

在 Azure Databricks 中，我有一个 pyspark 笔记本，必须使用不同的参数执行大约 10 到 20 次。由于各个任务不相互依赖并且不需要很多性能，我希望同时运行它们以加快进程。

我使用这个涉及期货的 python 代码从主笔记本并行运行笔记本定义列表：

def executeNotebook(notebook:NotebookData):
  print(f"Executing notebook {notebook.path}")
  try:
    return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.getParameters())
  except Exception as e:
    if notebook.retry < 1:
      failed = json.dumps({"status" : "failed", "error" : str(e), "notebook" : notebook.path})
      raise Exception(failed) 
    print(f"Retrying notebook {notebook.path}")
    notebook.retry -= 1

def tryFuture(future:Future):
  try:
    return future.result()
  except Exception as e:
    return str(e)

def parallelNotebooks(notebooks:List[NotebookData], maxParallel:int):
  print(f"Executing {len(notebooks)} notebooks with a maxParallel of {maxParallel}")
  with ThreadPoolExecutor(max_workers=maxParallel) as executor:
    results = [executor.submit(executeNotebook, notebook) for notebook in notebooks if notebook.enabled] 
    return [tryFuture(r) for r in results]

这在大多数情况下都有效，但是有时执行的笔记本列表中缺少一个或多个笔记本，Databricks 会出现以下错误。

Context not valid. If you are calling this outside the main thread, you must set the Notebook context via dbutils.notebook.setContext(ctx), where ctx is a value retrieved from the main thread (and the same cell) via dbutils.notebook.getContext()

有没有人知道我如何防止这种情况发生？我从哪里获得这个上下文，我必须在哪里设置它？

提前致谢！

在 Azure Databricks 中同时运行笔记本

0 个答案: