场景: 我们正在创建虚拟环境并安装所有require.txt文件,但在目录外部创建的文件很少。
用例: 我们想压缩该环境,并希望将其用于Spark驱动程序和执行程序
问题: 由于很少有文件从虚拟环境目录中安装,因此Spark失败,找不到模块异常或 lib * .so 文件不可用。
答案 0 :(得分:0)
为解决此问题,我应用了某些步骤:
写博客: https://kshitij-kuls.com/2019/08/04/setting-up-virtual-environment-for-pyspark/
在继续之前,需要了解python的基本结构:
├── bin
│ ├── activate
│ ├── activate.csh
│ ├── activate.fish
│ ├── activate_this.py
│ ├── easy_install
│ ├── easy_install-3.6
│ ├── pip
│ ├── pip3
│ ├── pip3.6
│ ├── python
│ ├── python-config
│ ├── python3 -> python
│ ├── python3.6 -> python
│ └── wheel
├── include
│ └── python3.6m -> /usr/include/python3.6m
├── lib
│ └── python3.6
| ├── site-packages
│ ├── lib-dynload -> /usr/lib/python3.6/lib-dynload [Dynamic Library]
环境变量:
PYSPARK_PYTHON : Points to the executable python file: bin/python
LD_LIBRARY_PATH : Points to the dynamic library path: lib/python3.6/lib-dynload [All .so* files]
PYTHONPATH:指向虚拟环境中已安装的软件包以及动态库路径:lib/python3.6/site-packages<CPS>lib/python3.6/lib-dynload [All .py files]
PYTHONHOME:指向python库路径:lib / python3.6 / site-packages
构建虚拟环境的步骤:
Install python in the machine of desired version.
Create Virtual Env
virtualenv env -p /usr/local/bin/python3
Activate Virtual Env
source env/bin/activate
Install requirements
pip install numpy
现在这是诀窍,您可以看到
线
├── lib-dynload -> /usr/lib/python3.6/lib-dynload
这是一个符号链接,指向本地计算机路径,因此,即使您仅压缩此虚拟环境文件夹,群集上也将缺少这些依赖项。
因此,需要将所有.so *文件从/usr/lib/python3.6/lib-dynload
,/usr/lib64/*.so.*
等复制到lib/python3.6/lib-dynload
将所有.py文件从/usr/lib/python3.6/lib-dynload
,/usr/lib64/*.so.*
等复制到lib/python3.6/site-packages
。
从虚拟环境的主目录运行它,在我们的例子中是env /
Prepare zip
zip -rq ../venv.zip *
Upload the zip to the /udf folder for tdss: /tookitaki/tdss/udf/
环境变量设置
对于驱动程序:spark.yarn.appMasterEnv.[Environment variable]
对于执行人:spark.executorEnv.[Environment variable]
PYSPARK_PYTHON
pyspark.spark.yarn.appMasterEnv.PYSPARK_PYTHON = venv/bin/python
pyspark.spark.executorEnv.PYSPARK_PYTHON = venv/bin/python
PYTHONHOME
pyspark.spark.yarn.appMasterEnv.PYTHONHOME = venv/lib64/python3.6/site-packages
pyspark.spark.executorEnv.PYTHONHOME = venv/lib64/python3.6/site-packages
LD_LIBRARY_PATH
pyspark.spark.yarn.appMasterEnv.LD_LIBRARY_PATH = venv/lib64/python3.6/lib-dynload
pyspark.spark.executorEnv.LD_LIBRARY_PATH = venv/lib64/python3.6/lib-dynload
PYTHONPATH
这个需要包含在YARN-ENV-ENTRIES中,它不是从spark配置中设置的。
PYTHONPATH = {{PWD}}/__venv__.zip<CPS>{{PWD}}/__py4j-0.10.7-src__.zip<CPS>venv/lib64/python3.6/site-packages<CPS>venv/lib64/python3.6/lib-dynload<CPS>
To run python
cd venv
export PYTHONPATH=lib64/python3.6/site-packages:lib64/python3.6/lib-dynload/
export LD_LIBRARY_PATH=lib64/python3.6/lib-dynload
源bin /激活