Question

我有一个使用Python SDK 2.2.0 for Apache Beam的管道。

这个管道几乎是一个典型的字数：我有("John Doe, Jane Smith", 1)格式的名字对，我试图弄清楚每对名字出现在一起的次数，如下所示： / p>

p_collection
            | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
            | "GroupByKey" >> beam.GroupByKey()
            | "AggregateGroups" >> beam.Map(lambda (pair, ones): (pair, sum(ones)))
            | "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})

当我在本地运行此代码时，使用一个小数据集，它可以很好地工作。

但是当我将其部署到Google Cloud DataFlow时，我收到以下错误：

尝试执行工作项时引发了异常 423109085466017585：回溯（最近一次调用最后一次）：文件＆＃34; /usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py" ;, 第582行，在do_work work_executor.execute（）文件中＆＃34; /usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py" ;, 第167行，执行op.start（）文件＆＃34; dataflow_worker / shuffle_operations.py＆＃34;，第49行，in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start def start（self）：File＆＃34; dataflow_worker / shuffle_operations.py＆＃34;，line 50，在 dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.scoped_start_state：文件＆＃34; dataflow_worker / shuffle_operations.py＆＃34;，第65行，in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start 使用self.shuffle_source.reader（）作为阅读器：文件＆＃34; dataflow_worker / shuffle_operations.py＆＃34;，第69行，in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start self.output（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第154行，in apache_beam.runners.worker.operations.Operation.output cython.cast（接收器， self.receivers [output_index]）。receive（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第86行，in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast（操作，消费者）.process（windowed_value）文件＆＃34; dataflow_worker / shuffle_operations.py＆＃34;，第233行，in dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process self.output（wvalue.with_value（（k，wvalue.value）））文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第154行，in apache_beam.runners.worker.operations.Operation.output cython.cast（接收器， self.receivers [output_index]）。receive（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第86行，in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast（操作，消费者）.process（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第339行，in apache_beam.runners.worker.operations.DoOperation.process with self.scoped_process_state：文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第340行，in apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive（o）文件＆＃34; apache_beam / runners / common.py＆＃34;，第382行，在apache_beam.runners.common.DoFnRunner.receive中 self.process（windowed_value）文件＆＃34; apache_beam / runners / common.py＆＃34;，第390行，在apache_beam.runners.common.DoFnRunner.process中 self._reraise_augmented（exn）File＆＃34; apache_beam / runners / common.py＆＃34;，第415行，在apache_beam.runners.common.DoFnRunner._reraise_augmented中提升文件＆＃34; apache_beam / runners / common.py＆＃34;，第388行，in apache_beam.runners.common.DoFnRunner.process self.do_fn_invoker.invoke_process（windowed_value）文件＆＃34; apache_beam / runners / common.py＆＃34;，第189行，in apache_beam.runners.common.SimpleInvoker.invoke_process self.output_processor.process_outputs（文件＆＃34; apache_beam / runners / common.py＆＃34;，第480行，in apache_beam.runners.common._OutputProcessor.process_outputs self.main_receivers.receive（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第86行，in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast（操作，消费者）.process（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第339行，in apache_beam.runners.worker.operations.DoOperation.process with self.scoped_process_state：文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第340行，in apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive（o）文件＆＃34; apache_beam / runners / common.py＆＃34;，第382行，在apache_beam.runners.common.DoFnRunner.receive中 self.process（windowed_value）文件＆＃34; apache_beam / runners / common.py＆＃34;，第390行，在apache_beam.runners.common.DoFnRunner.process中 self._reraise_augmented（exn）File＆＃34; apache_beam / runners / common.py＆＃34;，第431行，在apache_beam.runners.common.DoFnRunner._reraise_augmented中提升new_exn，无，original_traceback文件＆＃34; apache_beam / runners / common.py＆＃34;，第388行，in apache_beam.runners.common.DoFnRunner.process self.do_fn_invoker.invoke_process（windowed_value）文件＆＃34; apache_beam / runners / common.py＆＃34;，第189行，in apache_beam.runners.common.SimpleInvoker.invoke_process self.output_processor.process_outputs（文件＆＃34; apache_beam / runners / common.py＆＃34;，第480行，in apache_beam.runners.common._OutputProcessor.process_outputs self.main_receivers.receive（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第84行，in apache_beam.runners.worker.operations.ConsumerSet.receive self.update_counters_start（windowed_value）文件＆＃34; apache_beam / runners / worker / operations.py＆＃34;，第90行，in apache_beam.runners.worker.operations.ConsumerSet.update_counters_start self.opcounter.update_from（windowed_value）文件＆＃34; apache_beam / runners / worker / opcounters.py＆＃34;，第63行，in apache_beam.runners.worker.opcounters.OperationCounters.update_from self.do_sample（windowed_value）文件＆＃34; apache_beam / runners / worker / opcounters.py＆＃34;，第81行，在 apache_beam.runners.worker.opcounters.OperationCounters.do_sample self.coder_impl.get_estimated_size_and_observables（windowed_value））文件＆＃34; apache_beam / coders / coder_impl.py＆＃34;，第730行，在 apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables def get_estimated_size_and_observables（self，value，nested = False）：文件＆＃34; apache_beam / coders / coder_impl.py＆＃34;，第739行，in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables self._value_coder.get_estimated_size_and_observables（文件＆＃34; apache_beam / coders / coder_impl.py＆＃34;，518行，in apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables values [i]，nested = nested或i + 1＆lt; LEN（self._coder_impls））） RuntimeError：KeyError：0 [正在运行＆＃39;转换/格式化＆＃39;]

查看此错误弹出的at the source code，我认为可能是因为某些名称包含一些奇怪的编码字符这一事实，所以在绝望的行为中我尝试使用{{1你在代码上看到了，但没有运气。

关于为什么这个管道在本地成功执行但在DataFlow runner上失败的任何想法？

谢谢！

Answer 1

这不是解决我的问题的原因，因为它首先避免了问题，但确实让我的代码运行，这要归功于评论中user1093967的建议

我刚刚将GroupByKey和AggregateGroups替换为CombinePerKey(sum)步骤，问题不再发生了。

p_collection
        | "PairWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
        | "GroupAndSum" >> beam.CombinePerKey(sum)
        | "Format" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})

尽管如此，我很高兴听到它的原因。

Answer 2

在某些情况下，与我自己一样，您需要中间分组值，因此CombinePerKey并不理想。在这种更一般的情况下，您可以将GroupByKey()替换为CombineValues(ToListCombineFn())。

我不确定为什么这会有效，而GroupByKey却没有。我的猜测是，在并行执行环境中，消耗_UnwindowedValues返回的GroupByKey迭代类似于列表失败。我做了类似的事情：

... | beam.GroupByKey()
    | beam.Map(lambda k_v: (k_v[0], foo(list(k_v[1]))))
    | ...

其中foo需要完整的可索引列表，并且不易组合。不过，我不确定为什么这种限制会给你造成问题; sum可以在迭代上运行。

此解决方案并不理想（我相信）您在ToList转换时失去了一些并行化。话虽如此，如果其他人面临同样的问题，至少这是一个选择！

Answer 3

GroupByKey 将所有具有相同键的元素分组，并产生多个PCollections。下一阶段将收到一个Iterable，该Iterable将使用相同的键收集所有元素。重要说明是，至少当在Dataflow运行器上执行GroupByKey时，此Iterable才被延迟求值。这意味着当需要迭代器时，元素将按需加载到内存中。

另一方面，

CombinePerKey 也会将所有具有相同键的元素分组，但是会在发出单个值之前进行汇总。

pcollection_obj
        | "MapWithOne" >> beam.Map(lambda pair: (', '.join(pair).encode("ascii", errors="ignore").decode(), 1))
        | "GroupByKeyAndSum" >> beam.CombinePerKey(sum)
        | "CreateDictionary" >> beam.Map(lambda element: {'pair': element[0], 'pair_count': element[1]})

在Python中使用Google DataFlow运行时，Apache Beam GroupByKey（）失败

3 个答案: