使用GroupByKey遇到问题,我相信可以追溯到类型问题。我已经看了一段时间,并跟踪了一些堆栈跟踪,但是对我来说不清楚为什么以下错误
@beam.typehints.with_output_types(beam.typehints.Tuple[long, float])
class MultiMap(beam.DoFn):
def process(self, element):
items = element.split(',')
print items
r = (long(items[0]), float(items[10]))
print r
return r
pipeline = beam.Pipeline()
pcoll = pipeline | 'start' >> beam.Create(['14172425165068797305,3,0,3,0.07,0.36,1,4,4,3705.00765154,0.235002550513','2746375035268210383,3,0,3,0.07,0.36,2,5,5,3789.1391067,0.263046368899','16101396351712676789,3,0,3,0.07,0.37,1,4,3,3639.26112282,0.213087040939'])
multi = pcoll | "Multimap" >> beam.ParDo(MultiMap()).with_output_types(beam.typehints.Tuple[long, float])
使用DirectRunner,出现以下异常。
File "apache_beam/runners/worker/operations.py", line 227, in apache_beam.runners.worker.operations.ReadOperation.start File "apache_beam/runners/worker/operations.py", line 228, in apache_beam.runners.worker.operations.ReadOperation.start File "apache_beam/runners/worker/operations.py", line 229, in apache_beam.runners.worker.operations.ReadOperation.start File "apache_beam/runners/worker/operations.py", line 238, in apache_beam.runners.worker.operations.ReadOperation.start File "apache_beam/runners/worker/operations.py", line 159, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 85, in apache_beam.runners.worker.operations.ConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 392, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 393, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 488, in apache_beam.runners.common.DoFnRunner.receive File "apache_beam/runners/common.py", line 496, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 537, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "/usr/local/lib/python2.7/dist-packages/six.py", line 737, in raise_from
raise value TypeError: 'long' object is not subscriptable [while running 'Multimap']
需要弄清楚为什么这未能将ParDo的输出传递给GroupByKey。
答案 0 :(得分:0)
虽然我不清楚为什么需要更改,但我找到了解决方案。我所做的就是将处理方法中的“返回”更改为“收益”,并且可以正常工作。似乎没有返回完整的pcollection。同样,在示例中删除类型提示时,yield或return也可以。但是,使用类型提示时,只有yield有效。
这是非常令人惊讶的行为,并且很难调试。 beam docs on ParDo似乎使用了use return和yield可以互换,而没有说明何时使用它们。
这是错误还是只是缺少文档?