如何在Google Composer上安装dask

时间:2018-11-01 15:43:52

标签: airflow dask google-cloud-composer

我尝试在Google Composer(气流)上安装dask。我使用pypi(GCP UI)添加了dask和以下所需的软件包(虽然不确定是否所有的google软件包都是必需的,但找不到require.txt):

Stefan Löfven
Anders Ygeman
Annelie Karlsson
Lena Rådström Baastad
Ahlberg, Ann-Christin (S)
Andersson, Johan (S)
Axelsson, Marie (S)
...

当我运行具有dd.read_csv(“ a gcp bucket”)的DAG时,它在气流日志中显示以下错误:

 dask
 toolz
 partd
 cloudpickle
 google-cloud
 google-cloud-storage
 google-auth
 google-auth-oauthlib
 decorator

所以我尝试使用pypi安装gcsfs,但出现以下气流错误:

    [2018-10-24 22:25:12,729] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 350, in get_fs_token_paths
    [2018-10-24 22:25:12,733] {base_task_runner.py:98} INFO - Subtask:     fs, fs_token = get_fs(protocol, options)
    [2018-10-24 22:25:12,735] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 473, in get_fs
    [2018-10-24 22:25:12,740] {base_task_runner.py:98} INFO - Subtask:     "Need to install `gcsfs` library for Google Cloud Storage support\n"
    [2018-10-24 22:25:12,741] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/utils.py", line 94, in import_required
    [2018-10-24 22:25:12,748] {base_task_runner.py:98} INFO - Subtask:     raise RuntimeError(error_msg)
    [2018-10-24 22:25:12,751] {base_task_runner.py:98} INFO - Subtask: RuntimeError: Need to install `gcsfs` library for Google Cloud Storage support
    [2018-10-24 22:25:12,756] {base_task_runner.py:98} INFO - Subtask:     conda install gcsfs -c conda-forge
    [2018-10-24 22:25:12,758] {base_task_runner.py:98} INFO - Subtask:     or
    [2018-10-24 22:25:12,762] {base_task_runner.py:98} INFO - Subtask:     pip install gcsfs

似乎它被困在所需软件包的循环中!!不知道我在这里错过了什么吗?有什么想法吗?

1 个答案:

答案 0 :(得分:0)

您不需要在PyPi软件包中添加存储,它是already installed。我跑了一个dag(image-version:composer-1.3.0-airflow-1.10.0)记录了预安装软件包的版本,看来它是1.13.0。我还在dag中添加了以下内容,以复制您的案例:

import dask.dataframe as dd
def read_csv_dask():
    df = dd.read_csv('gs://gcs_path/data.csv')
    logging.info("csv from gs://gcs_path/ read alright")

在开始之前,我added via the UI具有以下依赖性:

dask==0.20.0
toolz==0.9.0
partd==0.3.9
cloudpickle==0.6.1

相应的任务失败,并显示与您相同的消息(“需要安装gcsfs库以支持Google Cloud Storage”),此时我返回UI并尝试添加gcsfs==0.1.2。这从未成功。但是,我没有得到您所做的错误,而是反复失败,并显示“ Composer Backend timed out”。

此时,您可以考虑以下替代方法:

1)在BashOperator中安装带有pip的gcsfs。这不是最佳选择,因为每次运行dag时都会安装gcsfs。

2)使用另一个库。您正在使用此csv做什么?如果您将其上传到gs://composer_gcs_bucket/data/目录(选中here),则可以使用例如像这样的csv标准库:

import csv
def read_csv():
    f=open('/home/airflow/gcs/data/data.csv', 'rU')
    reader = csv.reader(f)