使用Dataflow的Bigtable bulkload太慢

时间:2015-11-25 16:31:53

标签: google-cloud-dataflow google-cloud-bigtable

对于每3小时20GB数据文件等模式批量加载Bigtable的最佳方法是什么?数据流是否正确?

我们使用Dataflow批量加载Bigtable的问题是..

看起来Dataflow QPS与Bigtable(5个节点)的QPS不匹配。我正在尝试使用Dataflow将20GB文件加载到bigtable。需要4小时才能摄入bigtable。此外,我一直在运行期间收到此警告..

{
  "code" : 429,
  "errors" : [ {
    "domain" : "global",
    "message" : "Request throttled due to project QPS limit being reached.",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "Request throttled due to project QPS limit being reached.",
  "status" : "RESOURCE_EXHAUSTED"
}.

代码:

// CloudBigtableOptions is one way to retrieve the options. It's not
// required.
CloudBigtableOptions options = PipelineOptionsFactory.fromArgs(btargs.toArray(new String[btargs.size()]))
    .withValidation().as(CloudBigtableOptions.class);

// CloudBigtableTableConfiguration contains the project, zone, cluster
// and table to connect to.
CloudBigtableTableConfiguration config = CloudBigtableTableConfiguration.fromCBTOptions(options);

Pipeline p = Pipeline.create(options);

// This sets up serialization for Puts and Deletes so that Dataflow can
// potentially move them through the network.
CloudBigtableIO.initializeForWrite(p);

p.apply(TextIO.Read.from(inpath)).apply(ParDo.of(new CreatePutsFn(columns, delim)))
    .apply(CloudBigtableIO.writeToTable(config));

p.run();

CreatePutsFn:

@Override
public void processElement(DoFn<String, Mutation>.ProcessContext c) throws Exception {
    String[] vals = c.element().split(this.delim);
    for (int i = 0; i < columns.length; i++) {
        if (i != keyPos && vals[i].trim() != "") {
            c.output(new Put(vals[keyPos].getBytes()).addColumn(FAMILY, Bytes.toBytes(columns[i].toLowerCase()),
                    Bytes.toBytes(vals[i])));
        }
    }
}

非常感谢任何帮助。感谢

1 个答案:

答案 0 :(得分:3)

我能够解决这个问题。我做了以下三件事来达到预期的效果。现在这个作业运行并在大约15分钟内为一个(20 Gb)文件摄取数据..这个文件以前运行了4-5个小时。

  1. 此作业使用数据流在3分钟内创建了20亿次放置请求,现在通过批量连接所有列来减少到4000万次请求。
  2.     public void processElement(DoFn<String, Mutation>.ProcessContext c) throws Exception {
            String[] vals = c.element().split(this.delim);
            Put put = new Put(vals[keyPos].getBytes());
            for (int i = 0; i < columns.length; i++) {
                if (i != keyPos && vals[i].trim() != "") {
                    put.addColumn(FAMILY, Bytes.toBytes(columns[i].toLowerCase()), Bytes.toBytes(vals[i]));
    
                }
            }
            c.output(put);
        }
    
    1. 我添加了客户端写缓冲区的属性 config.toHBaseConfig().set("hbase.client.write.buffer", "200971520”);

    2. 你对QPS的限制达到了极限。因此,在批量加载操作期间,我暂时将簇大小提升到10个节点(从3开始)。