在docker Alpine中安装熊猫

时间:2019-02-26 16:45:51

标签: python pandas numpy docker alpine

我在尝试在docker中安装稳定的数据科学软件包配置时遇到了很大的困难。使用这样的主流相关工具应该更容易。

以下是用于 Dockerfile ,有些许改动,从软件包核心中删除了pandas并单独安装,并指定了pandas<0.21.0,因为据称更高版本与numpy冲突。

    FROM alpine:3.6

    ENV PACKAGES="\
    dumb-init \
    musl \
    libc6-compat \
    linux-headers \
    build-base \
    bash \
    git \
    ca-certificates \
    freetype \
    libgfortran \
    libgcc \
    libstdc++ \
    openblas \
    tcl \
    tk \
    libssl1.0 \
    "

ENV PYTHON_PACKAGES="\
    numpy \
    matplotlib \
    scipy \
    scikit-learn \
    nltk \
    " 

RUN apk add --no-cache --virtual build-dependencies python3 \
    && apk add --virtual build-runtime \
    build-base python3-dev openblas-dev freetype-dev pkgconfig gfortran \
    && ln -s /usr/include/locale.h /usr/include/xlocale.h \
    && python3 -m ensurepip \
    && rm -r /usr/lib/python*/ensurepip \
    && pip3 install --upgrade pip setuptools \
    && ln -sf /usr/bin/python3 /usr/bin/python \
    && ln -sf pip3 /usr/bin/pip \
    && rm -r /root/.cache \
    && pip install --no-cache-dir $PYTHON_PACKAGES \
    && pip3 install 'pandas<0.21.0' \    #<---------- PANDAS
    && apk del build-runtime \
    && apk add --no-cache --virtual build-dependencies $PACKAGES \
    && rm -rf /var/cache/apk/*

# set working directory
WORKDIR /usr/src/app

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt # other than data science packages go here
RUN pip install -r requirements.txt

# add entrypoint.sh
COPY ./entrypoint.sh /usr/src/app/entrypoint.sh

RUN chmod +x /usr/src/app/entrypoint.sh

# add app
COPY . /usr/src/app

# run server
CMD ["/usr/src/app/entrypoint.sh"]

上面的配置可以正常工作。 现在 发生的情况是构建确实已完成,但是pandas在导入时 失败,并显示以下内容错误:

ImportError: Missing required dependencies ['numpy']

自从安装了numpy 1.16.1以来,我不知道哪个numpy pandas正在尝试查找...

有人知道如何为此找到稳定的解决方案吗?

注意:一种解决方案,包括从用于数据科学的交钥匙docker映像中提取一个数据,并至少将上述软件包放入上述Dockerfile中,欢迎。


  

编辑1

如果我按照注释中的建议将安装的数据包移动到requirements.txt中,则像这样:

requirements.txt

(...)
numpy==1.16.1 # or numpy==1.16.0
scikit-learn==0.20.2
scipy==1.2.1
nltk==3.4   
pandas==0.24.1 # or pandas== 0.23.4
matplotlib==3.0.2 
(...)

Dockerfile

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt
RUN pip install -r requirements.txt

它在pandas再次中断,抱怨numpy

Collecting numpy==1.16.1 (from -r requirements.txt (line 61))
  Downloading https://files.pythonhosted.org/packages/2b/26/07472b0de91851b6656cbc86e2f0d5d3a3128e7580f23295ef58b6862d6c/numpy-1.16.1.zip (5.1MB)
Collecting scikit-learn==0.20.2 (from -r requirements.txt (line 62))
  Downloading https://files.pythonhosted.org/packages/49/0e/8312ac2d7f38537361b943c8cde4b16dadcc9389760bb855323b67bac091/scikit-learn-0.20.2.tar.gz (10.3MB)
Collecting scipy==1.2.1 (from -r requirements.txt (line 63))
  Downloading https://files.pythonhosted.org/packages/a9/b4/5598a706697d1e2929eaf7fe68898ef4bea76e4950b9efbe1ef396b8813a/scipy-1.2.1.tar.gz (23.1MB)
Collecting nltk==3.4 (from -r requirements.txt (line 64))
  Downloading https://files.pythonhosted.org/packages/6f/ed/9c755d357d33bc1931e157f537721efb5b88d2c583fe593cc09603076cc3/nltk-3.4.zip (1.4MB)
Collecting pandas==0.24.1 (from -r requirements.txt (line 65))
  Downloading https://files.pythonhosted.org/packages/81/fd/b1f17f7dc914047cd1df9d6813b944ee446973baafe8106e4458bfb68884/pandas-0.24.1.tar.gz (11.8MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 359, in get_provider
        module = sys.modules[moduleOrReq]
    KeyError: 'numpy'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 732, in <module>
        ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 475, in maybe_cythonize
        numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1144, in resource_filename
        return get_provider(package_or_requirement).get_resource_filename(
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 361, in get_provider
        __import__(moduleOrReq)
    ModuleNotFoundError: No module named 'numpy'

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-_e5z6o6_/pandas/

  

编辑2

这似乎是pandas个未解决的问题。有关更多详细信息,请参阅:

pandas-dev github

  

“不幸的是,这意味着require.txt文件不足以设置安装了熊猫的新环境(例如在docker容器中)”。

  **ImportError**:

  IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

  Importing the multiarray numpy extension module failed.  Most
  likely you are trying to import a failed build of numpy.
  Here is how to proceed:
  - If you're working with a numpy git repository, try `git clean -xdf`
    (removes all files not under version control) and rebuild numpy.
  - If you are simply trying to use the numpy version that you have installed:
    your installation is broken - please reinstall numpy.
  - If you have already reinstalled and that did not fix the problem, then:
    1. Check that you are using the Python you expect (you're using /usr/local/bin/python),
       and that you have no directories in your PATH or PYTHONPATH that can
       interfere with the Python and numpy versions you're trying to use.
    2. If (1) looks fine, you can open a new issue at
       https://github.com/numpy/numpy/issues.  Please include details on:
       - how you installed Python
       - how you installed numpy
       - your operating system
       - whether or not you have multiple versions of Python installed
       - if you built from source, your compiler versions and ideally a build log
  

编辑3

requirements.txt ---> https://pastebin.com/0icnx0iu

4 个答案:

答案 0 :(得分:10)

如果您未绑定Alpine 3.6,则可以使用Alpine 3.7(或更高版本)。

在Alpine 3.6上,安装matplotlib失败,原因如下:

Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/26/04/8b381d5b166508cc258632b225adbafec49bbe69aa9a4fa1f1b461428313/matplotlib-3.0.3.tar.gz (36.6MB)
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.org/simple/numpy/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    No local packages or working download links found for numpy>=1.10.0

但是,在Alpine 3.7上,它起作用了。这可能是由于numpy的版本问题(请参阅here),但是我无法确定。克服了这个问题,软件包的构建和安装成功完成-花了大约30分钟的时间(由于Alpine的musl-libc与Python的Wheels格式不兼容,所有安装了pip的软件包都必须从源代码构建)。

请注意,需要进行一项重要更改:您仅应在build-runtime之后删除apk del build-runtime虚拟包(pip install)。另外,如果适用,您可以将numpy 1.16.1替换为出厂的版本1.16.2(否则将卸载1.16.2并从源代码构建1.16.1,从而进一步提高了构建时间)-不过我还没有尝试过。

作为参考,这是我经过稍微修改的Dockerfile和docker build output

注意:

通常选择Alpine作为最小化图像大小的基础(Alpine也很光滑,但是由于glibc / musl而与大陆Linux应用程序存在兼容性问题)。为此,必须从源代码构建Python软件包,因为您会得到一个非常肿的映像-在进行任何清理之前需要900MB,这也需要很长时间才能构建。可以通过除去所有中间编译工件,构建依赖项等来极大地压缩图像,但是仍然可以。

如果您无法在Alpine上获得所需的Python软件包版本,而不必从源代码构建它们,我建议您尝试使用其他更小,更兼容的基本映像,例如debian-slim甚至是{ {1}}。

编辑:

在“编辑3”之后增加了要求,此处更新了Dockerfile和Docker build output。 添加了以下软件包来满足构建依赖性:

ubuntu

对于由于特定标头而无法生成的软件包,我使用了Alpine的软件包内容搜索来找到丢失的软件包。 专门针对postgresql-dev libffi-dev libressl-dev libxml2 libxml2-dev libxslt libxslt-dev libjpeg-turbo-dev zlib-dev ,缺少cffi标头,它需要ffi.h包:https://pkgs.alpinelinux.org/contents?file=ffi.h&path=&name=&branch=v3.7

或者,当软件包构建失败不是很明显时,可以参考特定软件包的安装说明,例如Pillow

在压缩之前,新的映像大小为1.04GB。为了减少它,您可以删除Python和pip缓存:

libffi-dev

使用RUN apk del build-runtime && \ find -type d -name __pycache__ -prune -exec rm -rf {} \; && \ rm -rf ~/.cache/pip 时,图像大小可减少到661MB。

答案 1 :(得分:3)

尝试将其添加到您的requirements.txt文件中:

numpy==1.16.0
pandas==0.23.4

自昨天以来,我一直面临着同样的错误,这种变化为我解决了这个问题。

答案 2 :(得分:0)

这可能并不完全相关,因为这在Alpine中搜索numpy / pandas安装失败时弹出的第一个答案,我要添加此答案。

以下修复程序对我有用(但是安装pandas / numpy需要更长的时间)

apk update
apk --no-cache add curl gcc g++
ln -s /usr/include/locale.h /usr/include/xlocale.h

现在尝试安装pandas / numpy

答案 3 :(得分:0)

Why does it take ages to install Pandas on Alpine Linux,有一个较老的问答环节。

如果您希望在不了解螺母和螺栓的情况下获得稳定的解决方案,那么对于python 3,您可以构建以下内容(从https://stackoverflow.com/a/50443531/1021819复制并粘贴我的答案)

FROM python:3.7-alpine
RUN echo "@testing http://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
RUN apk add --update --no-cache py3-numpy py3-pandas@testing

如果您的目标是了解如何来实现稳定的构建,则此处的讨论和相关图像可能也会有所帮助...