在 Azure Databricks 中,我有一个 pyspark 笔记本,必须使用不同的参数执行大约 10 到 20 次。由于各个任务不相互依赖并且不需要很多性能,我希望同时运行它们以加快进程。
我使用这个涉及期货的 python 代码从主笔记本并行运行笔记本定义列表:
def executeNotebook(notebook:NotebookData):
print(f"Executing notebook {notebook.path}")
try:
return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.getParameters())
except Exception as e:
if notebook.retry < 1:
failed = json.dumps({"status" : "failed", "error" : str(e), "notebook" : notebook.path})
raise Exception(failed)
print(f"Retrying notebook {notebook.path}")
notebook.retry -= 1
def tryFuture(future:Future):
try:
return future.result()
except Exception as e:
return str(e)
def parallelNotebooks(notebooks:List[NotebookData], maxParallel:int):
print(f"Executing {len(notebooks)} notebooks with a maxParallel of {maxParallel}")
with ThreadPoolExecutor(max_workers=maxParallel) as executor:
results = [executor.submit(executeNotebook, notebook) for notebook in notebooks if notebook.enabled]
return [tryFuture(r) for r in results]
这在大多数情况下都有效,但是有时执行的笔记本列表中缺少一个或多个笔记本,Databricks 会出现以下错误。
Context not valid. If you are calling this outside the main thread, you must set the Notebook context via dbutils.notebook.setContext(ctx), where ctx is a value retrieved from the main thread (and the same cell) via dbutils.notebook.getContext()
有没有人知道我如何防止这种情况发生?我从哪里获得这个上下文,我必须在哪里设置它?
提前致谢!