Apache Beam python,如何使用不同于pip的工具进行打包?

时间:2018-07-24 13:24:40

标签: python pip google-cloud-dataflow setuptools apache-beam

我有一个Apache Beam管道,可以通过Airflow触发。在我的开发环境中,Airflow能够构造包(我假设使用的是pip),然后将其上传到Dataflow以运行后记。但是,在我的生产环境中,这是另一回事了。

这是Airflow操作员的输出,负责启动Dataflow作业


*** Log file isn't local.
*** Fetching here: http://work4.podb.hello.io:8793/log/learning.pack_operation/model_predict/2018-07-17T00:00:00/2.log

[2018-07-23 21:00:07,943] {cli.py:374} INFO - Running on host work4.podb.hello.io
[2018-07-23 21:00:07,987] {models.py:1197} INFO - Dependencies all met for <TaskInstance: learning.pack_operation.model_predict 2018-07-17 00:00:00 [queued]>
[2018-07-23 21:00:08,004] {models.py:1197} INFO - Dependencies all met for <TaskInstance: learning.pack_operation.model_predict 2018-07-17 00:00:00 [queued]>
[2018-07-23 21:00:08,005] {models.py:1407} INFO - 
--------------------------------------------------------------------------------
Starting attempt 2 of 2
--------------------------------------------------------------------------------

[2018-07-23 21:00:08,026] {models.py:1428} INFO - Executing <Task(BashOperator): model_predict> on 2018-07-17 00:00:00
[2018-07-23 21:00:08,026] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run learning.pack_operation model_predict 2018-07-17T00:00:00 --job_id 725805 --raw -sd DAGS_FOLDER/learning/pack_operation/dag.py']
[2018-07-23 21:00:09,185] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,184] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2018-07-23 21:00:09,205] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,205] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
[2018-07-23 21:00:09,944] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,943] {configuration.py:206} WARNING - section/key [celery/celery_ssl_active] not found in config
[2018-07-23 21:00:09,945] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,944] {default_celery.py:41} WARNING - Celery Executor will run without SSL
[2018-07-23 21:00:09,951] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:09,946] {__init__.py:45} INFO - Using executor CeleryExecutor
[2018-07-23 21:00:10,088] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:10,087] {models.py:189} INFO - Filling up the DagBag from /usr/lib/hello-processing/processing/dags/learning/pack_operation/dag.py
[2018-07-23 21:00:12,242] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,242] {bash_operator.py:70} INFO - Tmp dir root location: 
[2018-07-23 21:00:12,243] {base_task_runner.py:98} INFO - Subtask:  /tmp
[2018-07-23 21:00:12,245] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,244] {bash_operator.py:80} INFO - Temporary script location: /tmp/airflowtmpiThdaC//tmp/airflowtmpiThdaC/model_predictiogrlD
[2018-07-23 21:00:12,246] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,245] {bash_operator.py:88} INFO - Running command: python /usr/lib/hello-processing/data_learning_tools/inference/model_predict/sklearn_api/predictor.py                 --input-filebase=gs://1e42-analytics_data/learning/pack_operation/20180717_1_0_0/extracted-*.json                 --output-table=hello-analytics:learning.pack_operation_inferred_1_0_020180717                 --model-path=gs://1e42-analytics_data/learning/airflow_settings/model.pkl                 --description-path=gs://1e42-analytics_data/learning/airflow_settings/description.json                 --ids=user_id,bought_pack,targeted                 --batch-size=1000                 --runner=DataflowRunner                 --job-name=model-pack-operation-20180723-210012                 --region=europe-west1                 --project=hello-analytics                 --max-num-workers=10                 --temp-location=gs://1e42-analytics_data/learning/airflow_settings/tmp                 --staging-location=gs://1e42-analytics_data/learning/airflow_settings/staging                 --library=xgboost
[2018-07-23 21:00:12,246] {base_task_runner.py:98} INFO - Subtask:               
[2018-07-23 21:00:12,254] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:12,253] {bash_operator.py:97} INFO - Output:
[2018-07-23 21:00:15,144] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - /usr/lib/hello-processing/oauth2client/contrib/gce.py:99: UserWarning: You have requested explicit scopes to be used with a GCE service account.
[2018-07-23 21:00:15,144] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - Using this argument will have no effect on the actual scopes for tokens
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - requested. These scopes are set at VM instance creation time and
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - can't be overridden in the request.
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - 
[2018-07-23 21:00:15,145] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,143] {bash_operator.py:101} INFO - warnings.warn(_SCOPES_WARNING)
[2018-07-23 21:00:15,147] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,146] {bash_operator.py:101} INFO - /usr/lib/hello-processing/apache_beam/io/gcp/gcsio.py:160: DeprecationWarning: object() takes no parameters
[2018-07-23 21:00:15,147] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,147] {bash_operator.py:101} INFO - super(GcsIO, cls).__new__(cls, storage_client))
[2018-07-23 21:00:15,416] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,415] {bash_operator.py:101} INFO - /usr/lib/hello-processing/apache_beam/coders/typecoders.py:133: UserWarning: Using fallback coder for typehint: Any.
[2018-07-23 21:00:15,417] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,416] {bash_operator.py:101} INFO - warnings.warn('Using fallback coder for typehint: %r.' % typehint)
[2018-07-23 21:00:15,448] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,447] {bash_operator.py:101} INFO - /usr/lib/hello-processing/apache_beam/coders/typecoders.py:133: UserWarning: Using fallback coder for typehint: Dict[Any, Any].
[2018-07-23 21:00:15,449] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:15,448] {bash_operator.py:101} INFO - warnings.warn('Using fallback coder for typehint: %r.' % typehint)
[2018-07-23 21:00:16,293] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,292] {bash_operator.py:101} INFO - running sdist
[2018-07-23 21:00:16,293] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,293] {bash_operator.py:101} INFO - running egg_info
[2018-07-23 21:00:16,303] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,303] {bash_operator.py:101} INFO - creating model_predict_sklearn.egg-info
[2018-07-23 21:00:16,305] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,305] {bash_operator.py:101} INFO - writing requirements to model_predict_sklearn.egg-info/requires.txt
[2018-07-23 21:00:16,306] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,305] {bash_operator.py:101} INFO - writing model_predict_sklearn.egg-info/PKG-INFO
[2018-07-23 21:00:16,307] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,307] {bash_operator.py:101} INFO - writing top-level names to model_predict_sklearn.egg-info/top_level.txt
[2018-07-23 21:00:16,308] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,307] {bash_operator.py:101} INFO - writing dependency_links to model_predict_sklearn.egg-info/dependency_links.txt
[2018-07-23 21:00:16,315] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,315] {bash_operator.py:101} INFO - writing manifest file 'model_predict_sklearn.egg-info/SOURCES.txt'
[2018-07-23 21:00:16,323] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,322] {bash_operator.py:101} INFO - reading manifest file 'model_predict_sklearn.egg-info/SOURCES.txt'
[2018-07-23 21:00:16,324] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO - writing manifest file 'model_predict_sklearn.egg-info/SOURCES.txt'
[2018-07-23 21:00:16,325] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO - warning: sdist: standard file not found: should have one of README, README.rst, README.txt
[2018-07-23 21:00:16,325] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO - 
[2018-07-23 21:00:16,325] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,324] {bash_operator.py:101} INFO - running check
[2018-07-23 21:00:16,419] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - warning: check: missing required meta-data: url
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - 
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - creating model-predict-sklearn-1.0.0
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - creating model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - creating model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,420] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - copying files to model-predict-sklearn-1.0.0...
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,417] {bash_operator.py:101} INFO - copying setup.py -> model-predict-sklearn-1.0.0
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/__init__.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/gcp.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,421] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/models.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/transforms.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying lib/utils.py -> model-predict-sklearn-1.0.0/lib
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/PKG-INFO -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,422] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/SOURCES.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,423] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,418] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/dependency_links.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,423] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,419] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/requires.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,427] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,419] {bash_operator.py:101} INFO - copying model_predict_sklearn.egg-info/top_level.txt -> model-predict-sklearn-1.0.0/model_predict_sklearn.egg-info
[2018-07-23 21:00:16,429] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,419] {bash_operator.py:101} INFO - Writing model-predict-sklearn-1.0.0/setup.cfg
[2018-07-23 21:00:16,429] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,426] {bash_operator.py:101} INFO - Creating tar archive
[2018-07-23 21:00:16,430] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:16,427] {bash_operator.py:101} INFO - removing 'model-predict-sklearn-1.0.0' (and everything under it)
[2018-07-23 21:00:17,172] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,172] {bash_operator.py:101} INFO - /usr/bin/python: No module named pip
[2018-07-23 21:00:17,178] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - Traceback (most recent call last):
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/data_learning_tools/inference/model_predict/sklearn_api/predictor.py", line 88, in <module>
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - main()
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/data_learning_tools/inference/model_predict/sklearn_api/predictor.py", line 79, in main
[2018-07-23 21:00:17,179] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,175] {bash_operator.py:101} INFO - write_disposition=BigQueryDisposition.WRITE_APPEND if is_partitioned_output else BigQueryDisposition.WRITE_TRUNCATE
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/pipeline.py", line 349, in __exit__
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - self.run().wait_until_finish()
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/pipeline.py", line 342, in run
[2018-07-23 21:00:17,180] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - return self.runner.run_pipeline(self)
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/dataflow_runner.py", line 315, in run_pipeline
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - self.dataflow_client.create_job(self.job), self)
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/utils/retry.py", line 175, in wrapper
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - return fun(*args, **kwargs)
[2018-07-23 21:00:17,181] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,176] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/apiclient.py", line 461, in create_job
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - self.create_job_description(job)
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/apiclient.py", line 491, in create_job_description
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - job.options, file_copy=self._gcs_file_copy)
[2018-07-23 21:00:17,182] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/dependency.py", line 400, in stage_job_resources
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/dependency.py", line 485, in _stage_beam_sdk_tarball
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - _dependency_file_copy(_download_pypi_sdk_package(temp_dir), staged_path)
[2018-07-23 21:00:17,183] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/runners/dataflow/internal/dependency.py", line 584, in _download_pypi_sdk_package
[2018-07-23 21:00:17,184] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,177] {bash_operator.py:101} INFO - processes.check_call(cmd_args)
[2018-07-23 21:00:17,184] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - File "/usr/lib/hello-processing/apache_beam/utils/processes.py", line 44, in check_call
[2018-07-23 21:00:17,185] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - return subprocess.check_call(*args, **kwargs)
[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - raise CalledProcessError(retcode, cmd)
[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpB3lvC3', 'google-cloud-dataflow==2.3.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 1
[2018-07-23 21:00:17,308] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,306] {bash_operator.py:105} INFO - Command exited with return code 1
[2018-07-23 21:00:17,346] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/hello-processing/airflow/bin/airflow", line 27, in <module>
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/hello-processing/airflow/bin/cli.py", line 392, in run
[2018-07-23 21:00:17,347] {base_task_runner.py:98} INFO - Subtask:     pool=args.pool,
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/hello-processing/airflow/utils/db.py", line 50, in wrapper
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask:     result = func(*args, **kwargs)
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/hello-processing/airflow/models.py", line 1493, in _run_raw_task
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask:     result = task_copy.execute(context=context)
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/hello-processing/airflow/operators/bash_operator.py", line 109, in execute
[2018-07-23 21:00:17,348] {base_task_runner.py:98} INFO - Subtask:     raise AirflowException("Bash command failed")
[2018-07-23 21:00:17,349] {base_task_runner.py:98} INFO - Subtask: airflow.exceptions.AirflowException: Bash command failed

这是我要上传到Dataflow的软件包的setup.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
import setuptools
import subprocess
from distutils.command.build import build as _build
from setuptools import find_packages, setup

logger = logging.getLogger("setup")
logger.setLevel(logging.INFO)


# Build
class build(_build):
    sub_commands = _build.sub_commands + [('CustomCommands', None)]


# Custom commands
CUSTOM_COMMANDS = [(["sudo", "apt-get", "update"], "."),
                   (["sudo", "apt-get", "install", "git", "build-essential", "libatlas-base-dev", "-y"], "."),
                   (["git", "clone", "--recursive", "https://github.com/dmlc/xgboost"], "."),
                   (["sudo", "make"], "xgboost"),
                   (["sudo", "python", "setup.py", "install"], "xgboost/python-package"),
                   (["sudo", "pip", "install", "xgboost"], ".")]


# Custom commands
class CustomCommands(setuptools.Command):

    # Initialize options
    def initialize_options(self):
        pass

    # Finalize options
    def finalize_options(self):
        pass

    # Run custom command
    def RunCustomCommand(self, command_list):
        logger.info('Running command: %s' % command_list[0])
        p = subprocess.Popen(command_list[0], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                             cwd=command_list[1])
        stdout_data, _ = p.communicate()
        logger.info('Command output: %s' % stdout_data)

        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' % (command_list[0], p.returncode))

    # Run
    def run(self):
        for command in CUSTOM_COMMANDS:
            self.RunCustomCommand(command)


# Get required packages
def get_required_packages():
    return [
        'google-cloud-storage==1.3.2',
        'numpy==1.13.0',
        'pandas==0.23.0',
        'scikit-learn==0.19.1',
        # 'xgboost==0.71'. Install XGBoost from source. Install from pip doesn't work as shared library is not available
        # directly on the workers (cf https://github.com/orfeon/dataflow-sample/blob/master/python/xgboost/setup.py)
    ]


# Setup
if __name__ == '__main__':
    setup(
        name='model-predict-sklearn',
        version='1.0.0',
        maintainer="Team Data Engineers",
        maintainer_email="data.engineers@hello.fr",
        packages=find_packages(),
        install_requires=get_required_packages(),
        cmdclass={
            'build': build,
            'CustomCommands': CustomCommands,
        }
    )

predict.py一起由Airflow BashOperator执行的python脚本。该文件中也定义了Apache Beam模板。


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import apache_beam as beam
import json
import os
import shutil
from apache_beam.io.gcp.bigquery import BigQueryDisposition

from lib.gcp import BigQuery, GCS
from lib.transforms import BatchDoFn, PredictDoFn, UnBatchDoFn
from lib.utils import Utils


# Import libraries needed in the transforms. They are given to the workers with the option save_main_session = True
import pandas as pd
import numpy as np


# Main
def main():
    conf = Utils.load_configuration()
    args = Utils.get_arguments(conf)

    options = args.__dict__



    # Copy to tmp
    base_directory = os.path.dirname(__file__).split(os.path.sep)[-1]
    dest_directory = os.path.join('/', 'tmp', base_directory)
    setup_file = os.path.join(dest_directory, 'setup.py')

    # Clean Tmp
    shutil.rmtree(dest_directory)

    shutil.copytree(os.path.dirname(__file__), dest_directory)
    shutil.copyfile(os.path.join(dest_directory, 'setup', '{}.py'.format(args.library)),
                    os.path.join(dest_directory, 'setup.py'))

    if 'DataflowRunner' == args.runner:
        # We pass all args in the pipeline options, only the relevant ones will be used!
        options['save_main_session'] = True
        options['setup_file'] = setup_file

    # Options
    pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)

    model_file = GCS.get_object_as_file(options['project'], options['model_path'])
    description_file = GCS.get_object_as_file(options['project'], options['description_path'])

    ids = options['ids'].split(',')

    description = json.load(description_file)

    # Map of column name to column type to be able to cast them after reading
    fields = {field["name"]: Utils.bigquery_to_pandas_type(field["type"]) for field in description}

    table_schema = BigQuery.make_table_schema(ids)

    # https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L83
    is_partitioned_output = '$' in options['output_table']

    with beam.Pipeline(args.runner, pipeline_options) as pipeline:
        outputs = (
                pipeline
                | 'ReadFromFile' >> beam.io.ReadFromText(options['input_filebase'])
                | 'DecodeLine' >> beam.Map(Utils.decode_input(ids))
                | 'Batch' >> beam.ParDo(BatchDoFn(options['batch_size']))
                | 'Predict' >> beam.ParDo(PredictDoFn(model_file, fields))
                | 'Unbatch' >> beam.ParDo(UnBatchDoFn())
                | 'FormatOutput' >> beam.Map(Utils.format_output)
        )

        outputs | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
            options['output_table'],
            schema=table_schema,
            create_disposition=BigQueryDisposition.CREATE_NEVER if is_partitioned_output else BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=BigQueryDisposition.WRITE_APPEND if is_partitioned_output else BigQueryDisposition.WRITE_TRUNCATE
        )

    # Clean Tmp
    shutil.rmtree(dest_directory)


# Script
if __name__ == '__main__':
    main()

以下是我的问题,如何在生产环境中不使用pip的情况下打包。 如果您查看Airflow BashOperator日志,尤其是这一部分

[2018-07-23 21:00:17,186] {base_task_runner.py:98} INFO - Subtask: [2018-07-23 21:00:17,178] {bash_operator.py:101} INFO - subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpB3lvC3', 'google-cloud-dataflow==2.3.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 1

也许我可以使用setuptools吗?

这是我在开发环境中得到的:

>/usr/bin/python -m pip --version
pip 18.0 from /usr/lib/hello-processing/lib/python2.7/site-packages/pip (python 2.7)

但是在我的产品环境中,我得到了:

>/usr/bin/python -m pip --version
/usr/bin/python: No module named pip

0 个答案:

没有答案