在数据砖集群中使用初始化脚本安装python软件包

时间:2020-06-22 13:55:45

标签: python linux bash cluster-computing azure-databricks

我通过运行以下命令安装了databricks cli工具

pip install databricks-cli使用适合您的Python安装的pip版本。如果您使用的是Python 3,请运行pip3。

然后通过创建PAT(Databricks中的个人访问令牌),我运行以下.sh bash脚本:

# You can run this on Windows as well, just change to a batch files
# Note: You need the Databricks CLI installed and you need a token configued
#!/bin/bash
echo "Creating DBFS direcrtory"
dbfs mkdirs dbfs:/databricks/packages

echo "Uploading cluster init script"
dbfs cp --overwrite python_dependencies.sh                     dbfs:/databricks/packages/python_dependencies.sh

echo "Listing DBFS direcrtory"
dbfs ls dbfs:/databricks/packages

python_dependencies.sh脚本

#!/bin/bash
# Restart cluster after running.

sudo apt-get install applicationinsights=0.11.9 -V -y
sudo apt-get install azure-servicebus=0.50.2 -V -y
sudo apt-get install azure-storage-file-datalake=12.0.0 -V -y
sudo apt-get install humanfriendly=8.2 -V -y
sudo apt-get install mlflow=1.8.0 -V -y
sudo apt-get install numpy=1.18.3 -V -y
sudo apt-get install opencensus-ext-azure=1.0.2 -V -y
sudo apt-get install packaging=20.4 -V -y
sudo apt-get install pandas=1.0.3 -V -y
sudo apt update
sudo apt-get install scikit-learn=0.22.2.post1 -V -y
status=$?
echo "The date command exit status : ${status}"

我使用上述脚本在集群的init脚本中安装python库

enter image description here

我的问题是,即使一切似乎都很好并且群集已成功启动,但库安装不正确。当我单击群集的“库”选项卡时,得到以下信息:

enter image description here 10个python库中仅安装了一个。

感谢您的帮助和评论。

1 个答案:

答案 0 :(得分:1)

我已经根据@RedCricket的评论找到了解决方案,

#!/bin/bash

pip install applicationinsights==0.11.9
pip install azure-servicebus==0.50.2
pip install azure-storage-file-datalake==12.0.0
pip install humanfriendly==8.2
pip install mlflow==1.8.0
pip install numpy==1.18.3
pip install opencensus-ext-azure==1.0.2
pip install packaging==20.4
pip install pandas==1.0.3
pip install --upgrade scikit-learn==0.22.2.post1

上面的.sh文件将安装启动群集时引用的所有python依赖项。因此,重新执行笔记本时不必重新安装这些库。