光束类型提示GroupByKey

时间:2018-06-29 20:47:34

标签: python apache-beam beam

使用GroupByKey遇到问题,我相信可以追溯到类型问题。我已经看了一段时间,并跟踪了一些堆栈跟踪,但是对我来说不清楚为什么以下错误

@beam.typehints.with_output_types(beam.typehints.Tuple[long, float])
class MultiMap(beam.DoFn):
   def process(self, element):
      items = element.split(',')
      print items
      r =  (long(items[0]), float(items[10]))
      print r
      return r


   pipeline = beam.Pipeline()
   pcoll = pipeline | 'start' >> beam.Create(['14172425165068797305,3,0,3,0.07,0.36,1,4,4,3705.00765154,0.235002550513','2746375035268210383,3,0,3,0.07,0.36,2,5,5,3789.1391067,0.263046368899','16101396351712676789,3,0,3,0.07,0.37,1,4,3,3639.26112282,0.213087040939'])
   multi = pcoll | "Multimap" >> beam.ParDo(MultiMap()).with_output_types(beam.typehints.Tuple[long, float])

使用DirectRunner,出现以下异常。

  File "apache_beam/runners/worker/operations.py", line 227, in apache_beam.runners.worker.operations.ReadOperation.start   File "apache_beam/runners/worker/operations.py", line 228, in apache_beam.runners.worker.operations.ReadOperation.start   File "apache_beam/runners/worker/operations.py", line 229, in apache_beam.runners.worker.operations.ReadOperation.start   File "apache_beam/runners/worker/operations.py", line 238, in apache_beam.runners.worker.operations.ReadOperation.start   File "apache_beam/runners/worker/operations.py", line 159, in apache_beam.runners.worker.operations.Operation.output   File "apache_beam/runners/worker/operations.py", line 85, in apache_beam.runners.worker.operations.ConsumerSet.receive   File "apache_beam/runners/worker/operations.py", line 392, in apache_beam.runners.worker.operations.DoOperation.process   File "apache_beam/runners/worker/operations.py", line 393, in apache_beam.runners.worker.operations.DoOperation.process   File "apache_beam/runners/common.py", line 488, in apache_beam.runners.common.DoFnRunner.receive   File "apache_beam/runners/common.py", line 496, in apache_beam.runners.common.DoFnRunner.process   File "apache_beam/runners/common.py", line 537, in apache_beam.runners.common.DoFnRunner._reraise_augmented   File "/usr/local/lib/python2.7/dist-packages/six.py", line 737, in raise_from
    raise value TypeError: 'long' object is not subscriptable [while running 'Multimap']

需要弄清楚为什么这未能将ParDo的输出传递给GroupByKey。

1 个答案:

答案 0 :(得分:0)

虽然我不清楚为什么需要更改,但我找到了解决方案。我所做的就是将处理方法中的“返回”更改为“收益”,并且可以正常工作。似乎没有返回完整的pcollection。同样,在示例中删除类型提示时,yield或return也可以。但是,使用类型提示时,只有yield有效。

这是非常令人惊讶的行为,并且很难调试。 beam docs on ParDo似乎使用了use return和yield可以互换,而没有说明何时使用它们。

这是错误还是只是缺少文档?