Question

我刚刚开始使用Beam / Dataflow。我阅读了文档，运行了示例，现在我对示例脚本进行了少量编辑，看看我是否可以应用我读过的内容。我对“最小字数”脚本做了一个小修改 - 见下文。（我删除了我的GCP信息。）

当我使用DirectRunner运行时，它运行正常并将结果上传到我的GCP云存储桶。但是，当我切换到DataflowRunner时，我收到以下错误：ImportError: No module named IPython.core并且作业失败。（代码下面粘贴了完整的错误消息。）

我知道这意味着缺少一个模块，但我不知道如何/在哪里导入该模块。或者，也许我完全误解了PTransform。任何指导都非常感谢。仅供参考，我使用的是Python 2.7（Anaconda）。

输入文件：https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv

代码

from __future__ import absolute_import

import argparse
import logging

import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions


def run(argv=None):
    """Main entry point; defines and runs the housing pipeline."""

    parser = argparse.ArgumentParser()
    parser.add_argument('--input',
                      dest='input',
                      default='gs://<bucket>/housing.csv',
                      help='Input file to process.')
    parser.add_argument('--output',
                      dest='output',
                      default='gs://<bucket>/housing_file',
                      help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)
    pipeline_args.extend([
#            '--runner=DirectRunner',         # local
            '--runner=DataflowRunner',        # GCP
            '--project=<project>',
            '--staging_location=gs://<bucket>/staging',
            '--temp_location=gs://<bucket>/temp',
            '--job_name=housing-data',
            ])

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = True
    with beam.Pipeline(options=pipeline_options) as p:

        # Read the text file[pattern] into a PCollection.
        lines = p | ReadFromText(known_args.input)

        # Simple transform: gets the X values from the .csv file
        class get_X_values(beam.DoFn):
            def process(self, element):
                value_list = element.split(',')[0:8]
                value_list.append(element.split(',')[9])
                return [",".join(value_list)]            
        X_values = lines | beam.ParDo(get_X_values())


        # Write the output using a "Write" transform that has side effects.
        # pylint: disable=expression-not-assigned
        X_values | WriteToText(known_args.output, file_name_suffix='.txt')


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

错误消息

JOB_MESSAGE_ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 733, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 472, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 247, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
    module = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
    return getattr(__import__(module, None, None, [obj]), obj)
ImportError: No module named IPython.core

Answer 1

我通过在命令窗口中运行IDE（spyder）外部的文件来使我的代码工作。我想这算作一个解决方案，但我仍然不知道为什么首先出现错误。

FWIW，即使在向工作人员添加依赖项之后，我仍然得到相同的ImportError，当我在IDE中运行代码时，。如果有人知道为什么这是一个问题或者我可以做些什么来修复它，我仍然感兴趣...

GCP数据流：ImportError：没有名为IPython.core的模块

1 个答案: