使用Python的Google Dataflow无法安装工作流程:退出状态1

时间:2019-04-17 13:53:38

标签: python-2.7 google-cloud-dataflow

我的Google Dataflow作业通过本地运行程序在本地运行,但是无法构建其软件包以通过DataflowRunner运行管道。我在apache-beam[gcp]==2.6.0上遇到了这个问题,同一管道在apache-beam[gcp]==2.4.0上起作用

我的代码可以在本地DirectRunner正常工作,并且构建软件包python setup.py sdist --formats=tar并安装pip install dist/my-package.tar也是可以的。

作业失败,并显示错误消息:

Failed to install packages: failed to install workflow: exit status 1

在以下信息日志之后抛出此错误,似乎表明数据流容器中的系统numpy缺少METADATA

Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/numpy-1.14.5.dist-info/METADATA'
Failed to report setup error to service: could not lease work item to report failure (no work items returned)

基于上述numpy错误,我安装了numpy 1.14.5来解决了我的问题。我仍然面临无法调试程序包设置的问题,因为Dataflow构建其容器的确切方式非常不透明。

我的问题不在我的setup.py上,否则sdist构建不应该起作用。数据流的Docker映像构建过程与dataflow.gcr.io/v1beta3/python:2.6.0不匹配,因为该映像中没有安装numpy或beam。由于缺乏可复制的docker构建,调试工作流变得很困难。

关于我的工作流程设置代码的一些上下文:

我使用自定义命令从https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz安装了neuralcoref库,其余的setup.py是:

...
REQUIRED_PACKAGES = [
    'six==1.12.0',
    'dill==0.2.9',
    'apache-beam[gcp]==2.6.0',
    'spacy==2.0.13',
    'requests==2.18.4',
    'unidecode==1.0.22',
    'tqdm==4.23.3',
    'lxml==4.2.1',
    'python-dateutil==2.7.3',
    'textblob==0.15.1',
    'networkx==2.1',
    'flashtext==2.7',
    'annoy==1.12.0',
    'ujson==1.35',
    'repoze.lru==0.7',
    'Whoosh==2.7.4',
    'python-Levenshtein==0.12.0',
    'fuzzywuzzy==0.16.0',
    'attrs==19.1.0',
    # 'scikit-learn==0.19.1',# preinstalled in dataflow
    # 'pandas==0.23.0',# preinstalled in dataflow
    # 'scipy==1.1.0',# preinstalled in dataflow

]

setuptools.setup(
    name='myproject',
    version='0.0.6',
    description='my project',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
    cmdclass={
        # Command class instantiated and run during pip install scenarios.
        'build': build,
        'CustomCommands': CustomCommands,
    }
)

我的本​​地requirements.txt是:

six==1.12.0
apache-beam[gcp]==2.6.0
spacy==2.0.13
requests==2.18.4
unidecode==1.0.22
tqdm==4.23.3
lxml==4.2.1
python-dateutil==2.7.3
textblob==0.15.1
networkx==2.1
flashtext==2.7
annoy==1.12.0
ujson==1.35
repoze.lru==0.7
Whoosh==2.7.4
python-Levenshtein==0.12.0
fuzzywuzzy==0.16.0
attrs==19.1.0
scikit-learn==0.19.1
pandas==0.23.0
scipy==1.1.0

完整的错误消息是:

{
 insertId:  "7107501484934866351:1025729:0:380041"  
 jsonPayload: {
  line:  "boot.go:145"   
  message:  "Failed to install packages: failed to install workflow: exit status 1"   
 }
 labels: {
  compute.googleapis.com/resource_id:  "7107501484934866351"   
  compute.googleapis.com/resource_name:  "myjob-04170525-av5b-harness-0w5w"   
  compute.googleapis.com/resource_type:  "instance"   
  dataflow.googleapis.com/job_id:  "2019-04-17_05_25_10-4738638106522967260"   
  dataflow.googleapis.com/job_name:  "myjob"   
  dataflow.googleapis.com/region:  "us-central1"   
 }
 logName:  "projects/myproject/logs/dataflow.googleapis.com%2Fworker-startup"  
 receiveTimestamp:  "2019-04-17T13:21:37.786576023Z"  
 resource: {
  labels: {
   job_id:  "2019-04-17_05_25_10-4738638106522967260"    
   job_name:  "myjob"    
   project_id:  "myproject"    
   region:  "us-central1"    
   step_id:  ""    
  }
  type:  "dataflow_step"   
 }
 severity:  "CRITICAL"  
 timestamp:  "2019-04-17T13:21:19.954714Z"  
}

1 个答案:

答案 0 :(得分:1)

您是否要在setup.py中配置Beam的版本?我认为那不会奏效。数据流的版本必须与您从中运行作业的版本相匹配。

每个版本的Beam都有自己的数据流容器。可以从此处获取用于2.6.0的数据流容器:dataflow.gcr.io/v1beta3/python:2.6.0 2.4.0和2.6.0之间存在显着差异,包括pip版本。

为帮助您进一步调试,请添加setup.py的副本。知道安装了哪个版本的apache-beam(来自async function translate() { // Imports the Google Cloud client library const { Translate } = require('@google-cloud/translate'); // Creates a client const translate = new Translate(); /** * TODO(developer): Uncomment the following lines before running the sample. */ const text = 'Hello, world!'; const target = 'ru'; // Translates the text into the target language. "text" can be a string for // translating a single piece of text, or an array of strings for translating // multiple texts. let [translations] = await translate.translate(text, target); translations = Array.isArray(translations) ? translations : [translations]; console.log('Translations:'); translations.forEach((translation, i) => { console.log(`${text[i]} => (${target}) ${translation}`); }); } translate() )也很有用。