我有许多大型未分区的BigQuery表和文件,我想以各种方式进行分区。所以我决定尝试编写一个Dataflow作业来实现这一目标。我认为这项工作很简单。我尝试用泛型编写,以便我轻松地将它应用于TextIO和BigQueryIO源。它适用于小型表,但是当我在大型表上运行它时,我会继续java.lang.OutOfMemoryError: Java heap space
。
在我的主类中,我要么读取带有目标键的文件(使用另一个DF作业制作),要么对BigQuery表运行查询以获取要进行分片的键列表。我的主要课程如下:
Pipeline sharder = Pipeline.create(opts);
// a functional interface that shows the tag map how to get a tuple tag
KeySelector<String, TableRow> bqSelector = (TableRow row) -> (String) row.get("COLUMN") != null ? (String) row.get("COLUMN") : "null";
// a utility class to store a tuple tag list and hash map of String TupleTag
TupleTagMap<String, TableRow> bqTags = new TupleTagMap<>(new ArrayList<>(inputKeys),bqSelector);
// custom transorm
ShardedTransform<String, TableRow> bqShard = new ShardedTransform<String, TableRow>(bqTags, TableRowJsonCoder.of());
String source = "PROJECTID:ADATASET.A_BIG_TABLE";
String destBase = "projectid:dataset.a_big_table_sharded_";
TableSchema schema = bq.tables().get("PROJECTID","ADATASET","A_BIG_TABLE").execute().getSchema();
PCollectionList<TableRow> shards = sharder.apply(BigQueryIO.Read.from(source)).apply(bqShard);
for (PCollection<TableRow> shard : shards.getAll()) {
String shardName = StringUtils.isNotEmpty(shard.getName()) ? shard.getName() : "NULL";
shard.apply(BigQueryIO.Write.to(destBase + shardName)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
System.out.println(destBase+shardName);
}
sharder.run();
我生成一组TupleTags
以用于自定义转换。我创建了一个存储TupleTagList
和HashMap
的实用程序类,以便我可以按键引用元组标记:
public class TupleTagMap<Key, Type> implements Serializable {
private static final long serialVersionUID = -8762959703864266959L;
final private TupleTagList tList;
final private Map<Key, TupleTag<Type>> map;
final private KeySelector<Key, Type> selector;
public TupleTagMap(List<Key> t, KeySelector<Key, Type> selector) {
map = new HashMap<>();
for (Key key : t)
map.put(key, new TupleTag<Type>());
this.tList = TupleTagList.of(new ArrayList<>(map.values()));
this.selector = selector;
}
public Map<Key, TupleTag<Type>> getMap() {
return map;
}
public TupleTagList getTagList() {
return tList;
}
public TupleTag<Type> getTag(Type t){
return map.get(selector.getKey(t));
}
然后我有这个自定义变换,它基本上有一个函数,它使用元组映射输出PCollectionTuple
,然后将其移动到PCollectionList
以返回主类:
public class ShardedTransform<Key, Type> extends
PTransform<PCollection<Type>, PCollectionList<Type>> {
private static final long serialVersionUID = 3320626732803297323L;
private final TupleTagMap<Key, Type> tags;
private final Coder<Type> coder;
public ShardedTransform(TupleTagMap<Key, Type> tags, Coder<Type> coder) {
this.tags = tags;
this.coder = coder;
}
@Override
public PCollectionList<Type> apply(PCollection<Type> in) {
PCollectionTuple shards = in.apply(ParDo.of(
new ShardFn<Key, Type>(tags)).withOutputTags(
new TupleTag<Type>(), tags.getTagList()));
List<PCollection<Type>> shardList = new ArrayList<>(tags.getMap().size());
for (Entry<Key, TupleTag<Type>> e : tags.getMap().entrySet()){
PCollection<Type> shard = shards.get(e.getValue()).setName(e.getKey().toString()).setCoder(coder);
shardList.add(shard);
}
return PCollectionList.of(shardList);
}
}
实际的DoFn很简单它只使用主类中提供的lambda在哈希映射中找到匹配的元组标记以进行侧输出:
public class ShardFn<Key, Type> extends DoFn<Type, Type> {
private static final long serialVersionUID = 961325260858465105L;
private final TupleTagMap<Key, Type> tags;
ShardFn(TupleTagMap<Key, Type> tags) {
this.tags = tags;
}
@Override
public void processElement(DoFn<Type, Type>.ProcessContext c)
throws Exception {
Type element = c.element();
TupleTag<Type> tag = tags.getTag(element);
if (tag != null)
c.sideOutput(tags.getTag(element), element);
}
}
答案 0 :(得分:1)
Beam模型目前对动态分区/大量分区没有很好的支持。您的方法在图形构建时选择了分片数量,然后生成的ParDos可能全部融合在一起,因此您可以让每个工作者同时尝试写入80个不同的BQ表。每次写入都需要一些本地缓冲,所以它可能太多了。
有一种替代方法可以跨表进行并行化(但不跨元素)。如果您有大量相对较小的输出表,这将很有效。使用ParDo用它应该去的表标记每个元素,然后执行GroupByKey。这会给你一个PCollection<KV<Table, Iterable<ElementsForThatTable>>>
。然后通过将元素写入表来处理每个KV<Table, Iterable<ElementsForThatTable>>
。
不幸的是,现在你必须手动写BQ来使用这个选项。我们正在考虑扩展Sink API,内置支持。由于Dataflow SDK作为Apache Beam的一部分进一步开发,我们在此处跟踪该请求:https://issues.apache.org/jira/browse/BEAM-92