我刚刚开始使用Beam / Dataflow。我阅读了文档,运行了示例,现在我对示例脚本进行了少量编辑,看看我是否可以应用我读过的内容。我对“最小字数”脚本做了一个小修改 - 见下文。 (我删除了我的GCP信息。)
当我使用DirectRunner
运行时,它运行正常并将结果上传到我的GCP云存储桶。但是,当我切换到DataflowRunner
时,我收到以下错误:ImportError: No module named IPython.core
并且作业失败。 (代码下面粘贴了完整的错误消息。)
我知道这意味着缺少一个模块,但我不知道如何/在哪里导入该模块。或者,也许我完全误解了PTransform
。任何指导都非常感谢。仅供参考,我使用的是Python 2.7(Anaconda)。
输入文件:https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv
代码
from __future__ import absolute_import
import argparse
import logging
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def run(argv=None):
"""Main entry point; defines and runs the housing pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='gs://<bucket>/housing.csv',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
default='gs://<bucket>/housing_file',
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
# '--runner=DirectRunner', # local
'--runner=DataflowRunner', # GCP
'--project=<project>',
'--staging_location=gs://<bucket>/staging',
'--temp_location=gs://<bucket>/temp',
'--job_name=housing-data',
])
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Simple transform: gets the X values from the .csv file
class get_X_values(beam.DoFn):
def process(self, element):
value_list = element.split(',')[0:8]
value_list.append(element.split(',')[9])
return [",".join(value_list)]
X_values = lines | beam.ParDo(get_X_values())
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
X_values | WriteToText(known_args.output, file_name_suffix='.txt')
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
错误消息
JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 733, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 472, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 247, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
return getattr(__import__(module, None, None, [obj]), obj)
ImportError: No module named IPython.core
答案 0 :(得分:0)
我通过在命令窗口中运行IDE(spyder)外部的文件来使我的代码工作。我想这算作一个解决方案,但我仍然不知道为什么首先出现错误。
FWIW,即使在向工作人员添加依赖项之后,我仍然得到相同的ImportError,当我在IDE中运行代码时,。如果有人知道为什么这是一个问题或者我可以做些什么来修复它,我仍然感兴趣...