Question

我有一个用例，我必须将大查询表读入数据流管道，然后读取该PCollection中的每一行以构建图形数据结构。然后将图形作为SideInput传递给需要此图形的更多变换步骤和另一个大查询表PCollection作为参数。以下就是我现在所拥有的：

Pipeline pipeline = Pipeline.create(options);

//Read from big query
PCollection<TableRow> bqTable = pipeline.apply("ReadFooBQTable", BigQueryIO.Read.from("Table"));

//Loop over PCollection create "graph" still need to figure this out


//pass the graph as side input 
pCol.apply("Process", ParDo.withSideInputs(graph).of(new BlueKai.ProcessBatch(graph))).apply("Write",
    Write.to(new DecoratedFileSink<String>(standardBucket, "csv", TextIO.DEFAULT_TEXT_CODER, null, null, WriterOutputGzipDecoratorFactory.getInstance())).withNumShards(numChunks));

Answer 1

问题是如何序列化图形以在机器之间传递它。如果您为如何序列化表示图形的元素定义Coder，那么您可以将其用作您所描述的侧面输入。

假设图形可以编码，那么您只需将其用作单例侧输入。这假设可以在一台机器上处理行数。您可能需要定义一个CombineFn<TableRow, Graph, Graph>来计算表行中的图形。假设可以组合两个图形（例如，它是关联和交换操作），那么您可以使用组合加asSingletonView。

另一种方法是使用List<TableRow>作为侧输入，让每台机器构建图形。

循环访问PCollection以生成Graph数据结构，然后将其作为SideInput传递给管道转换

1 个答案: