Question

我正在Kubernetes上运行pyspark应用程序v2.4.0，我的spark应用程序取决于numpy和tensorflow模块，请建议将这些依赖项添加到Spark执行程序的方法。

我已经检查了文档，我们可以使用--py-files，-jars等包含远程依赖项，但是没有提及库依赖项。

Answer 1

找到了将库依赖项添加到K8S上的Spark应用程序的方法，想在这里共享它。

在Dockerfile中提及必需的依赖项安装命令并重建spark映像，当我们提交spark作业时，新的容器也将使用依赖项实例化。

Dokerfile（/ {spark_folder_path} / resource-managers / kubernetes / docker / src / main / dockerfiles / spark / bindings / python / Dockerfile）内容：

RUN apk add --no-cache python && \
    apk add --no-cache python3 && \
    python -m ensurepip && \
    python3 -m ensurepip && \
    # We remove ensurepip since it adds no functionality since pip is
    # installed on the image and it just takes up 1.6MB on the image
    rm -r /usr/lib/python*/ensurepip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    pip install numpy && \
    # Removed the .cache to save space
    rm -r /root/.cache

Kubernetes上Spark应用程序的Numpy和其他库依赖项

1 个答案: