Question

我试图在另一个Databricks中运行Jupyter Notebook。

以下代码失败，错误是“未定义df3”。但是，定义了df3。

input_file = pd.read_csv("/dbfs/mnt/container_name/input_files/xxxxxx.csv")
df3 = input_file
%run ./NotebookB

NotebookB的第一行代码如下（所有Markdown均显示在Databricks中，没有问题）：

df3.iloc[:,1:] = df3.iloc[:,1:].clip(lower=0)

我的Jupyter笔记本没有出现这样的错误，例如下面的代码有效：

input_file = pd.read_csv("xxxxxx.csv")
df3 = input_file
%run "NotebookB.ipynb"

基本上，似乎在Databricks中运行NotebookB时，未使用或忘记了df3的定义，从而导致“未定义”错误。

两个Jupyter Notebook都位于Databricks中的同一Workspace文件夹中。

Answer 1

我看到您希望通过调用将诸如DataFrame之类的结构化数据从Azure Databricks Notebook传递到另一个。

请参阅官方文档Notebook Workflows，以了解如何使用函数dbutils.notebook.run和dbutils.notebook.exit来实现。

这是上面官方文档的Pass structured data部分中的Python示例代码。

%python

# Example 1 - returning data through temporary tables.
# You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the same JVM, you can
# return a name referencing data stored in a temporary table.

## In callee notebook
sqlContext.range(5).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")

## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))

因此，要在代码中传递pandas数据框，您需要先使用以下spark.createDataFrame函数将pandas数据框转换为pyspark数据框。

df3 = spark.createDataFrame(input_file)

然后通过下面的代码传递它。

df3.createOrReplaceGlobalTempView("df3")
dbutils.notebook.exit("df3")

与此同时，更改NotebookA和NotebookB的角色，并从NotebookA作为呼叫者来呼叫NotebookB作为被呼叫者。

Answer 2

在notebook A中，将df保存到csv，然后调用notebook B作为参数传递csv的路径。 notebook B从路径读取，进行一些操作，并覆盖 csv。 notebook A从同一路径读取，现在具有所需的结果。

一个例子：

笔记本A（呼叫者）

# write df to /path/test-csv.csv
df = spark.range(10)
df.write.csv(path = '/path/test-csv.csv')
df.show()

# call notebook B with the csv path /path/test-csv.csv
nb = "/path/notebook-b"
dbutils.notebook.run(nb, 60, {'df_path': 'dbfs:/path/test-csv.csv'})

# now that the transf has completed [err-handling-here], read again from the same path
spark.read.format("csv").load('dbfs:/path/test-csv.csv').show()

输出：

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

+---+---+
|_c0|_c1|
+---+---+
|  0|0.0|
|  1|2.0|
|  2|4.0|
|  3|6.0|
|  4|8.0|
+---+---+

笔记本B（被叫方）

# crt var for path
dbutils.widgets.text("df_path", '/', 'df-test')
df_path = dbutils.widgets.get("df_path")

# read from path
df = spark.read.format("csv").load(df_path)

# execute whatever operation
df = df.withColumn('2x', df['_c0'] * 2)

# overwrite the transf ds to the same path
df.write.csv(path = df_path, mode = "overwrite")

dbutils.notebook.exit(0)

将一个Databricks Notebook导入另一个错误

2 个答案: