我有一个Apache Beam管道,可以通过Airflow触发。在我的开发环境中,Airflow能够构造包(我假设使用的是pip),然后将其上传到Dataflow以运行后记。但是,在我的生产环境中,这是另一回事了。
这是Airflow操作员的输出,负责启动Dataflow作业
*** Log file isn't local.
*** Fetching here: http://work4.podb.hello.io:8793/log/learning.pack_operation/model_predict/2018-07-17T00:00:00/2.log
[2018-07-23 21:00:07,943] {cli.py:374} INFO - Running on host work4.podb.hello.io
[2018-07-23 21:00:07,987] {models.py:1197} INFO - Dependencies all met for <TaskInstance: learning.pack_operation.model_predict 2018-07-17 00:00:00 [queued]>
[2018-07-23 21:00:08,004] {models.py:1197} INFO - Dependencies all met for <TaskInstance: learning.pack_operation.model_predict 2018-07-17 00:00:00 [queued]>
[2018-07-23 21:00:08,005] {models.py:1407} INFO -
--------------------------------------------------------------------------------
Starting attempt 2 of 2
--------------------------------------------------------------------------------
[2018-07-23 21:00:08,026] {models.py:1428} INFO - Executing <Task(BashOperator): model_predict> on 2018-07-17 00:00:00
[2018-07-23 21:00:08,026] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run learning.pack_operation model_predict 2018-07-17T00:00:00 --job_id 725805 --raw -sd DAGS_FOLDER/learning/pack_operation/dag.py']
[2018-07-23 21:00:09,185] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,184] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2018-07-23 21:00:09,205] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,205] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
[2018-07-23 21:00:09,944] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,943] {configuration.py:206} WARNING - section/key [celery/celery_ssl_active] not found in config
[2018-07-23 21:00:09,945] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,944] {default_celery.py:41} WARNING - Celery Executor will run without SSL
[2018-07-23 21:00:09,951] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,946] {__init__.py:45} INFO - Using executor CeleryExecutor
[2018-07-23 21:00:10,088] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:10,087] {models.py:189} INFO - Filling up the DagBag from /usr/lib/hello-processing/processing/dags/learning/pack_operation/dag.py
[2018-07-23 21:00:12,242] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,242] {bash_operator.py:70} INFO - Tmp dir root location:
[2018-07-23 21:00:12,243] {base_task_runner.py:98} INFO - Subtask: /tmp
[2018-07-23 21:00:12,245] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,244] {bash_operator.py:80} INFO - Temporary script location: /tmp/airflowtmpiThdaC//tmp/airflowtmpiThdaC/model_predictiogrlD
[2018-07-23 21:00:12,246] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,245] {bash_operator.py:88} INFO - Running command: python /usr/lib/hello-processing/data_learning_tools/inference/model_predict/sklearn_api/predictor.py --input-filebase=gs://1e42-analytics_data/learning/pack_operation/20180717_1_0_0/extracted-*.json --output-table=hello-analytics:learning.pack_operation_inferred_1_0_020180717 --model-path=gs://1e42-analytics_data/learning/airflow_settings/model.pkl --description-path=gs://1e42-analytics_data/learning/airflow_settings/description.json --ids=user_id,bought_pack,targeted --batch-size=1000 --runner=DataflowRunner --job-name=model-pack-operation-20180723-210012 --region=europe-west1 --project=hello-analytics --max-num-workers=10 --temp-location=gs://1e42-analytics_data/learning/airflow_settings/tmp --staging-location=gs://1e42-analytics_data/learning/airflow_settings/staging --library=xgboost
[2018-07-23 21:00:12,246] {base_task_runner.py:98} INFO - Subtask:
[2018-07-23 21:00:12,254] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,253] {bash_operator.py:97} INFO - Output:
[2018-07-23 21:00:15,144] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - /usr/lib/hello-processing/oauth2client/contrib/gce.py:99: UserWarning: You have requested explicit scopes to be used with a GCE service account.
[2018-07-23 21:00:15,144] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - Using this argument will have no effect on the actual scopes for tokens
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - requested. These scopes are set at VM instance creation time and
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - can't be overridden in the request.
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO -
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - warnings.warn(_SCOPES_WARNING)
[2018-07-23 21:00:15,147] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,146] {bash_operator.py:101} INFO - /usr/lib/hello-processing/apache_beam/io/gcp/gcsio.py:160: DeprecationWarning: object() takes no parameters
[2018-07-23 21:00:15,147] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,147] {bash_operator.py:101} INFO - super(GcsIO, cls).__new__(cls, storage_client))
[2018-07-23 21:00:15,416] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,415] {bash_operator.py:101} INFO - /usr/lib/hello-processing/apache_beam/coders/typecoders.py:133: UserWarning: Using fallback coder for typehint: Any.
[2018-07-23 21:00:15,417] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,416] {bash_operator.py:101} INFO - warnings.warn('Using fallback coder for typehint: %r.' % typehint)
[2018-07-23 21:00:15,448] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,447] {bash_operator.py:101} INFO - /usr/lib/hello-processing/apache_beam/coders/typecoders.py:133: UserWarning: Using fallback coder for typehint: Dict[Any, Any].
[2018-07-23 21:00:15,449] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,448] {bash_operator.py:101} INFO - warnings.warn('Using fallback coder for typehint: %r.' % typehint)
[2018-07-23 21:00:16,293] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,292] {bash_operator.py:101} INFO - running sdist
[2018-07-23 21:00:16,293] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,293] {bash_operator.py:101} INFO - running egg_info
[2018-07-23 21:00:16,303] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,303] {bash_operator.py:101} INFO - creating model_predict_sklearn.egg-info
[2018-07-23 21:00:16,305] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,305] {bash_operator.py:101} INFO - writing requirements to model_predict_sklearn.egg-info/requires.txt
[2018-07-23 21:00:16,306] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,305] {bash_operator.py:101} INFO - writing model_predict_sklearn.egg-info/PKG-INFO
[2018-07-23 21:00:16,307] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,307] {bash_operator.py:101} INFO - writing top-level names to model_predict_sklearn.egg-info/top_level.txt
[2018-07-23 21:00:16,308] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,307] {bash_operator.py:101} INFO - writing dependency_links to model_predict_sklearn.egg-info/dependency_links.txt
[2018-07-23 21:00:16,315] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,315] {bash_operator.py:101} INFO - writing manifest file 'model_predict_sklearn.egg-info/SOURCES.txt'
[2018-07-23 21:00:16,323] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,322] {bash_operator.py:101} INFO - reading manifest file 'model_predict_sklearn.egg-info/SOURCES.txt'
[2018-07-23 21:00:16,324] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO - writing manifest file 'model_predict_sklearn.egg-info/SOURCES.txt'
[2018-07-23 21:00:16,325] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO - warning: sdist: standard file not found: should have one of README, README.rst, README.txt
[2018-07-23 21:00:16,325] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO -
[2018-07-23 21:00:16,325] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO - running check
[2018-07-23 21:00:16,419] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - warning: check: missing required meta-data: url
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO -
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - creating model-predict-sklearn-1.0.0
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - creating model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - creating model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - copying files to model-predict-sklearn-1.0.0...
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - copying setup.py -> model-predict-sklearn-1.0.0
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/__init__.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/gcp.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/models.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/transforms.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/utils.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/PKG-INFO -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/SOURCES.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,423] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/dependency_links.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,423] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,419] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/requires.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,427] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,419] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/top_level.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,429] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,419] {bash_operator.py:101} INFO - Writing model-predict-sklearn-1.0.0/setup.cfg
[2018-07-23 21:00:16,429] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,426] {bash_operator.py:101} INFO - Creating tar archive
[2018-07-23 21:00:16,430] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,427] {bash_operator.py:101} INFO - removing 'model-predict-sklearn-1.0.0' (and everything under it)
[2018-07-23 21:00:17,172] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,172] {bash_operator.py:101} INFO - /usr/bin/python: No module named pip
[2018-07-23 21:00:17,178] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - Traceback (most recent call last):
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/data_learning_tools/inference/model_predict/sklearn_api/predictor.py", line 88, in <module>
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - main()
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/data_learning_tools/inference/model_predict/sklearn_api/predictor.py", line 79, in main
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - write_disposition=BigQueryDisposition.WRITE_APPEND if is_partitioned_output else BigQueryDisposition.WRITE_TRUNCATE
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/pipeline.py", line 349, in __exit__
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - self.run().wait_until_finish()
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/pipeline.py", line 342, in run
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - return self.runner.run_pipeline(self)
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/dataflow_runner.py", line 315, in run_pipeline
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - self.dataflow_client.create_job(self.job), self)
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/utils/retry.py", line 175, in wrapper
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - return fun(*args, **kwargs)
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/apiclient.py", line 461, in create_job
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - self.create_job_description(job)
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/apiclient.py", line 491, in create_job_description
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - job.options, file_copy=self._gcs_file_copy)
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/dependency.py", line 400, in stage_job_resources
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/dependency.py", line 485, in _stage_beam_sdk_tarball
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - _dependency_file_copy(_download_pypi_sdk_package(temp_dir), staged_path)
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/dependency.py", line 584, in _download_pypi_sdk_package
[2018-07-23 21:00:17,184] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - processes.check_call(cmd_args)
[2018-07-23 21:00:17,184] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/utils/processes.py", line 44, in check_call
[2018-07-23 21:00:17,185] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - return subprocess.check_call(*args, **kwargs)
[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - raise CalledProcessError(retcode, cmd)
[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpB3lvC3', 'google-cloud-dataflow==2.3.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 1
[2018-07-23 21:00:17,308] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,306] {bash_operator.py:105} INFO - Command exited with return code 1
[2018-07-23 21:00:17,346] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/hello-processing/airflow/bin/airflow", line 27, in <module>
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/hello-processing/airflow/bin/cli.py", line 392, in run
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/hello-processing/airflow/utils/db.py", line 50, in wrapper
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/hello-processing/airflow/models.py", line 1493, in _run_raw_task
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask: File "/usr/lib/hello-processing/airflow/operators/bash_operator.py", line 109, in execute
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask: raise AirflowException("Bash command failed")
[2018-07-23 21:00:17,349] {base_task_runner.py:98} INFO - Subtask: airflow.exceptions.AirflowException: Bash command failed
这是我要上传到Dataflow的软件包的setup.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import setuptools
import subprocess
from distutils.command.build import build as _build
from setuptools import find_packages, setup
logger = logging.getLogger("setup")
logger.setLevel(logging.INFO)
# Build
class build(_build):
sub_commands = _build.sub_commands + [('CustomCommands', None)]
# Custom commands
CUSTOM_COMMANDS = [(["sudo", "apt-get", "update"], "."),
(["sudo", "apt-get", "install", "git", "build-essential", "libatlas-base-dev", "-y"], "."),
(["git", "clone", "--recursive", "https://github.com/dmlc/xgboost"], "."),
(["sudo", "make"], "xgboost"),
(["sudo", "python", "setup.py", "install"], "xgboost/python-package"),
(["sudo", "pip", "install", "xgboost"], ".")]
# Custom commands
class CustomCommands(setuptools.Command):
# Initialize options
def initialize_options(self):
pass
# Finalize options
def finalize_options(self):
pass
# Run custom command
def RunCustomCommand(self, command_list):
logger.info('Running command: %s' % command_list[0])
p = subprocess.Popen(command_list[0], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
cwd=command_list[1])
stdout_data, _ = p.communicate()
logger.info('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError('Command %s failed: exit code: %s' % (command_list[0], p.returncode))
# Run
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
# Get required packages
def get_required_packages():
return [
'google-cloud-storage==1.3.2',
'numpy==1.13.0',
'pandas==0.23.0',
'scikit-learn==0.19.1',
# 'xgboost==0.71'. Install XGBoost from source. Install from pip doesn't work as shared library is not available
# directly on the workers (cf https://github.com/orfeon/dataflow-sample/blob/master/python/xgboost/setup.py)
]
# Setup
if __name__ == '__main__':
setup(
name='model-predict-sklearn',
version='1.0.0',
maintainer="Team Data Engineers",
maintainer_email="data.engineers@hello.fr",
packages=find_packages(),
install_requires=get_required_packages(),
cmdclass={
'build': build,
'CustomCommands': CustomCommands,
}
)
与predict.py
一起由Airflow BashOperator执行的python脚本。该文件中也定义了Apache Beam模板。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import apache_beam as beam
import json
import os
import shutil
from apache_beam.io.gcp.bigquery import BigQueryDisposition
from lib.gcp import BigQuery, GCS
from lib.transforms import BatchDoFn, PredictDoFn, UnBatchDoFn
from lib.utils import Utils
# Import libraries needed in the transforms. They are given to the workers with the option save_main_session = True
import pandas as pd
import numpy as np
# Main
def main():
conf = Utils.load_configuration()
args = Utils.get_arguments(conf)
options = args.__dict__
# Copy to tmp
base_directory = os.path.dirname(__file__).split(os.path.sep)[-1]
dest_directory = os.path.join('/', 'tmp', base_directory)
setup_file = os.path.join(dest_directory, 'setup.py')
# Clean Tmp
shutil.rmtree(dest_directory)
shutil.copytree(os.path.dirname(__file__), dest_directory)
shutil.copyfile(os.path.join(dest_directory, 'setup', '{}.py'.format(args.library)),
os.path.join(dest_directory, 'setup.py'))
if 'DataflowRunner' == args.runner:
# We pass all args in the pipeline options, only the relevant ones will be used!
options['save_main_session'] = True
options['setup_file'] = setup_file
# Options
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
model_file = GCS.get_object_as_file(options['project'], options['model_path'])
description_file = GCS.get_object_as_file(options['project'], options['description_path'])
ids = options['ids'].split(',')
description = json.load(description_file)
# Map of column name to column type to be able to cast them after reading
fields = {field["name"]: Utils.bigquery_to_pandas_type(field["type"]) for field in description}
table_schema = BigQuery.make_table_schema(ids)
# https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L83
is_partitioned_output = '$' in options['output_table']
with beam.Pipeline(args.runner, pipeline_options) as pipeline:
outputs = (
pipeline
| 'ReadFromFile' >> beam.io.ReadFromText(options['input_filebase'])
| 'DecodeLine' >> beam.Map(Utils.decode_input(ids))
| 'Batch' >> beam.ParDo(BatchDoFn(options['batch_size']))
| 'Predict' >> beam.ParDo(PredictDoFn(model_file, fields))
| 'Unbatch' >> beam.ParDo(UnBatchDoFn())
| 'FormatOutput' >> beam.Map(Utils.format_output)
)
outputs | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
options['output_table'],
schema=table_schema,
create_disposition=BigQueryDisposition.CREATE_NEVER if is_partitioned_output else BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND if is_partitioned_output else BigQueryDisposition.WRITE_TRUNCATE
)
# Clean Tmp
shutil.rmtree(dest_directory)
# Script
if __name__ == '__main__':
main()
以下是我的问题,如何在生产环境中不使用pip
的情况下打包。
如果您查看Airflow BashOperator日志,尤其是这一部分
[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpB3lvC3', 'google-cloud-dataflow==2.3.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 1
也许我可以使用setuptools
吗?
这是我在开发环境中得到的:
>/usr/bin/python -m pip --version
pip 18.0 from /usr/lib/hello-processing/lib/python2.7/site-packages/pip (python 2.7)
但是在我的产品环境中,我得到了:
>/usr/bin/python -m pip --version
/usr/bin/python: No module named pip