Question

我有许多大型未分区的BigQuery表和文件，我想以各种方式进行分区。所以我决定尝试编写一个Dataflow作业来实现这一目标。我认为这项工作很简单。我尝试用泛型编写，以便我轻松地将它应用于TextIO和BigQueryIO源。它适用于小型表，但是当我在大型表上运行它时，我会继续java.lang.OutOfMemoryError: Java heap space。

在我的主类中，我要么读取带有目标键的文件（使用另一个DF作业制作），要么对BigQuery表运行查询以获取要进行分片的键列表。我的主要课程如下：

Pipeline sharder = Pipeline.create(opts);

// a functional interface that shows the tag map how to get a tuple tag
KeySelector<String, TableRow> bqSelector = (TableRow row) -> (String) row.get("COLUMN") != null ? (String) row.get("COLUMN") : "null";

// a utility class to store a tuple tag list and hash map of String TupleTag
TupleTagMap<String, TableRow> bqTags = new TupleTagMap<>(new ArrayList<>(inputKeys),bqSelector);

// custom transorm
ShardedTransform<String, TableRow> bqShard = new ShardedTransform<String, TableRow>(bqTags, TableRowJsonCoder.of());

String source = "PROJECTID:ADATASET.A_BIG_TABLE";
String destBase = "projectid:dataset.a_big_table_sharded_";

TableSchema schema = bq.tables().get("PROJECTID","ADATASET","A_BIG_TABLE").execute().getSchema();


PCollectionList<TableRow> shards = sharder.apply(BigQueryIO.Read.from(source)).apply(bqShard);
for (PCollection<TableRow> shard : shards.getAll()) {
    String shardName = StringUtils.isNotEmpty(shard.getName()) ? shard.getName() : "NULL";
    shard.apply(BigQueryIO.Write.to(destBase + shardName)
            .withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
            .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
            .withSchema(schema));
    System.out.println(destBase+shardName);
} 
sharder.run();

我生成一组TupleTags以用于自定义转换。我创建了一个存储TupleTagList和HashMap的实用程序类，以便我可以按键引用元组标记：

public class TupleTagMap<Key, Type> implements Serializable {

private static final long serialVersionUID = -8762959703864266959L;
final private TupleTagList tList;
final private Map<Key, TupleTag<Type>> map;
final private KeySelector<Key, Type> selector;

public TupleTagMap(List<Key> t, KeySelector<Key, Type> selector) {
    map = new HashMap<>();
    for (Key key : t)
        map.put(key, new TupleTag<Type>());
    this.tList = TupleTagList.of(new ArrayList<>(map.values()));
    this.selector = selector;

}

public Map<Key, TupleTag<Type>> getMap() {
    return map;
}

public TupleTagList getTagList() {
    return tList;
}

public TupleTag<Type> getTag(Type t){
    return map.get(selector.getKey(t));
}

然后我有这个自定义变换，它基本上有一个函数，它使用元组映射输出PCollectionTuple，然后将其移动到PCollectionList以返回主类：

public class ShardedTransform<Key, Type> extends
    PTransform<PCollection<Type>, PCollectionList<Type>> {


private static final long serialVersionUID = 3320626732803297323L;
private final TupleTagMap<Key, Type> tags;
private final Coder<Type> coder;


public ShardedTransform(TupleTagMap<Key, Type> tags, Coder<Type> coder) {
    this.tags = tags;
    this.coder = coder;
}

@Override
public PCollectionList<Type> apply(PCollection<Type> in) {

    PCollectionTuple shards = in.apply(ParDo.of(
            new ShardFn<Key, Type>(tags)).withOutputTags(
            new TupleTag<Type>(), tags.getTagList()));

    List<PCollection<Type>> shardList = new ArrayList<>(tags.getMap().size());

    for (Entry<Key, TupleTag<Type>> e : tags.getMap().entrySet()){
        PCollection<Type> shard = shards.get(e.getValue()).setName(e.getKey().toString()).setCoder(coder);
        shardList.add(shard);
    }
        return PCollectionList.of(shardList);
    } 
}

实际的DoFn很简单它只使用主类中提供的lambda在哈希映射中找到匹配的元组标记以进行侧输出：

public class ShardFn<Key, Type> extends DoFn<Type, Type> {

private static final long serialVersionUID = 961325260858465105L;

private final TupleTagMap<Key, Type> tags;

ShardFn(TupleTagMap<Key, Type> tags) {

    this.tags = tags;
}

@Override
public void processElement(DoFn<Type, Type>.ProcessContext c)
        throws Exception {
    Type element = c.element();
    TupleTag<Type> tag = tags.getTag(element);

    if (tag != null)
        c.sideOutput(tags.getTag(element), element);
    } 
}

Answer 1

Beam模型目前对动态分区/大量分区没有很好的支持。您的方法在图形构建时选择了分片数量，然后生成的ParDos可能全部融合在一起，因此您可以让每个工作者同时尝试写入80个不同的BQ表。每次写入都需要一些本地缓冲，所以它可能太多了。

有一种替代方法可以跨表进行并行化（但不跨元素）。如果您有大量相对较小的输出表，这将很有效。使用ParDo用它应该去的表标记每个元素，然后执行GroupByKey。这会给你一个PCollection<KV<Table, Iterable<ElementsForThatTable>>>。然后通过将元素写入表来处理每个KV<Table, Iterable<ElementsForThatTable>>。

不幸的是，现在你必须手动写BQ来使用这个选项。我们正在考虑扩展Sink API，内置支持。由于Dataflow SDK作为Apache Beam的一部分进一步开发，我们在此处跟踪该请求：https://issues.apache.org/jira/browse/BEAM-92

创建多个标记输出时，Google数据流存在堆中

1 个答案: