我尝试在Google Composer(气流)上安装dask。我使用pypi(GCP UI)添加了dask和以下所需的软件包(虽然不确定是否所有的google软件包都是必需的,但找不到require.txt):
Stefan Löfven
Anders Ygeman
Annelie Karlsson
Lena Rådström Baastad
Ahlberg, Ann-Christin (S)
Andersson, Johan (S)
Axelsson, Marie (S)
...
当我运行具有dd.read_csv(“ a gcp bucket”)的DAG时,它在气流日志中显示以下错误:
dask
toolz
partd
cloudpickle
google-cloud
google-cloud-storage
google-auth
google-auth-oauthlib
decorator
所以我尝试使用pypi安装gcsfs,但出现以下气流错误:
[2018-10-24 22:25:12,729] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 350, in get_fs_token_paths
[2018-10-24 22:25:12,733] {base_task_runner.py:98} INFO - Subtask: fs, fs_token = get_fs(protocol, options)
[2018-10-24 22:25:12,735] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 473, in get_fs
[2018-10-24 22:25:12,740] {base_task_runner.py:98} INFO - Subtask: "Need to install `gcsfs` library for Google Cloud Storage support\n"
[2018-10-24 22:25:12,741] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/utils.py", line 94, in import_required
[2018-10-24 22:25:12,748] {base_task_runner.py:98} INFO - Subtask: raise RuntimeError(error_msg)
[2018-10-24 22:25:12,751] {base_task_runner.py:98} INFO - Subtask: RuntimeError: Need to install `gcsfs` library for Google Cloud Storage support
[2018-10-24 22:25:12,756] {base_task_runner.py:98} INFO - Subtask: conda install gcsfs -c conda-forge
[2018-10-24 22:25:12,758] {base_task_runner.py:98} INFO - Subtask: or
[2018-10-24 22:25:12,762] {base_task_runner.py:98} INFO - Subtask: pip install gcsfs
似乎它被困在所需软件包的循环中!!不知道我在这里错过了什么吗?有什么想法吗?
答案 0 :(得分:0)
您不需要在PyPi软件包中添加存储,它是already installed。我跑了一个dag(image-version:composer-1.3.0-airflow-1.10.0)记录了预安装软件包的版本,看来它是1.13.0。我还在dag中添加了以下内容,以复制您的案例:
import dask.dataframe as dd
def read_csv_dask():
df = dd.read_csv('gs://gcs_path/data.csv')
logging.info("csv from gs://gcs_path/ read alright")
在开始之前,我added via the UI具有以下依赖性:
dask==0.20.0
toolz==0.9.0
partd==0.3.9
cloudpickle==0.6.1
相应的任务失败,并显示与您相同的消息(“需要安装gcsfs
库以支持Google Cloud Storage”),此时我返回UI并尝试添加gcsfs==0.1.2
。这从未成功。但是,我没有得到您所做的错误,而是反复失败,并显示“ Composer Backend timed out”。
此时,您可以考虑以下替代方法:
1)在BashOperator中安装带有pip的gcsfs。这不是最佳选择,因为每次运行dag时都会安装gcsfs。
2)使用另一个库。您正在使用此csv做什么?如果您将其上传到gs://composer_gcs_bucket/data/
目录(选中here),则可以使用例如像这样的csv标准库:
import csv
def read_csv():
f=open('/home/airflow/gcs/data/data.csv', 'rU')
reader = csv.reader(f)