我可以将侧输入传递给Apache Beam PTransforms吗?

时间:2018-03-07 15:55:58

标签: tensorflow google-cloud-dataflow apache-beam

我使用Apache Beam预处理TensorFlow的数据。我想根据数据集中的示例数量选择TFRecord分片的数量。相关的代码部分是:

EXAMPLES_PER_SHARD = 5.0
num_tfexamples = tfexample_strs | "count tf examples" >> beam.combiners.Count.Globally()
num_shards = num_tfexamples | ("compute number of shards" >>
                               beam.Map(lambda num_examples: int(math.ceil(num_examples / EXAMPLES_PER_SHARD))))
_ = tfexample_strs | ("output to tfrecords" >>
                      beam.io.WriteToTFRecord(OUTPUT_DIR, num_shards=beam.pvalue.AsSingleton(num_shards)))

这与stacktrace失败了:

File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/iobase.py", line 1011, in start_bundle
    self.counter = random.randint(0, self.count - 1)
TypeError: unsupported operand type(s) for -: 'AsSingleton' and 'int' [while running 'output VALIDATION to tfrecords/Write/WriteImpl/ParDo(_RoundRobinKeyFn)']

我在PTransform的类定义中看到了这一行

# By default, transforms don't have any side inputs.
side_inputs = ()

是否可以将侧输入传递给PTransforms?谢谢你的帮助

1 个答案:

答案 0 :(得分:1)

WriteToTFRecord不支持使用num_shards的旁边输入。理论上没有什么能阻止它这样做(并且在Java SDK中可能),它只是没有在Python SDK中实现。随意提交JIRA