Question

我们正在尝试运行每日Dataflow管道来读取Bigtable并将数据转储到GCS中（使用HBase的Scan和BaseResultCoder作为编码器），如下所示（只是为了强调这个想法）：

  Pipeline pipeline = Pipeline.create(options); 
  Scan scan = new Scan();
  scan.setCacheBlocks(false).setMaxVersions(1);
  scan.addFamily(Bytes.toBytes("f"));
  CloudBigtableScanConfiguration btConfig = BCloudBigtableScanConfiguration.Builder().withProjectId("aaa").withInstanceId("bbb").withTableId("ccc").withScan(scan).build();
  pipeline.apply(Read.from(CloudBigtableIO.read(btConfig))).apply(TextIO.Write.to("gs://bucket/dir/file").withCoder(HBaseResultCoder.getInstance()));
  pipeline.run();

这似乎完全符合预期。

现在，我们希望能够在GCS中使用转储文件来进行恢复作业（如果需要）。也就是说，我们希望有一个数据流管道从GCS读取转储数据（即PCollection）并创建Mutations（基本上是＆＃39; Put＆＃39;对象）。由于某种原因，以下代码失败了一堆NullPointerExceptions。我们不确定为什么会出现这种情况 - 在if语句之下添加了检查null或0-length字符串以查看是否会产生差异，但事实并非如此。

// Part of DoFn<Result,Mutation>
@Override
public void processElement(ProcessContext c) {
  Result result = c.element();
  byte[] row = result.getRow();
  if (row == null || row.length == 0) { // NullPointerException at this line
    return;
  }
  Put mutation = new Put(result.getRow());
  // go through the column/value entries of this row, and create a corresponding put mutation.
  for (Entry<byte[], byte[]> entry : result.getFamilyMap(Bytes.toBytes(cf)).entrySet()) {
    byte[] qualifier = entry.getKey();
    if (qualifier == null || qualifier.length == 0) {
      continue;
    }
    byte[] val = entry.getValue();
    if (val == null || val.length == 0) {
      continue;
    }
    mutation.addImmutable(cf_bytes, qualifier, entry.getValue());
  }
  c.output(mutation);
}

我们得到的错误如下（第83行标记在上面）：

(2a6ad6372944050d): java.lang.NullPointerException at some.package.RecoveryFromGcs$CreateMutationFromResult.processElement(RecoveryFromGcs.java:83)

我有两个问题： 1.当他们尝试在PCollection上进行ParDo以获得要写入bigtable的PCollection时，是否有人经历过类似的事情？这是一种合理的方法吗？最终目标是能够通过备份来定期保留我们的bigtable（针对特定列系列）的每日快照，以防发生不良事件。我们希望能够通过数据流读取备份数据，并在需要时将其写入bigtable。

任何建议和帮助都将非常感谢！

--------编辑

以下是扫描Bigtable并将数据转储到GCS的代码：（如果它们不相关，则隐藏一些细节。）

public static void execute(Options options) {
  Pipeline pipeline = Pipeline.create(options);
  final String cf = "f"; // some specific column family.
  Scan scan = new Scan();
  scan.setCacheBlocks(false).setMaxVersions(1); // Disable caching and read only the latest cell.
  scan.addFamily(Bytes.toBytes(cf));

  CloudBigtableScanConfiguration btConfig =
      BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), "some-bigtable-name").withScan(scan).build();

  PCollection<Result> result = pipeline.apply(Read.from(CloudBigtableIO.read(btConfig)));

  PCollection<Mutation> mutation =
      result.apply(ParDo.of(new CreateMutationFromResult(cf))).setCoder(new HBaseMutationCoder());

  mutation.apply(TextIO.Write.to("gs://path-to-files").withCoder(new HBaseMutationCoder()));

  pipeline.run();
}

}

读取上述代码输出的作业具有以下代码：（这是从GCS阅读时抛出的一个例外）

public static void execute(Options options) {
  Pipeline pipeline = Pipeline.create(options);
  PCollection<Mutation> mutations = pipeline.apply(TextIO.Read
      .from("gs://path-to-files").withCoder(new HBaseMutationCoder()));

  CloudBigtableScanConfiguration config =
      BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), btTarget).build();
  if (config != null) {
    CloudBigtableIO.initializeForWrite(pipeline);
    mutations.apply(CloudBigtableIO.writeToTable(config));
  }
  pipeline.run();
}

}

我得到的错误（https://jpst.it/Qr6M）有点令人困惑，因为突变都是Put对象，但错误是关于＆＃39;删除＆＃39;对象

Answer 1

最好在cloud bigtable client github issues page上讨论这个问题。我们目前正在开发像这样的导入/导出功能，因此我们会快速响应。即使您没有添加github问题，我们也会自己探索这种方法。 github问题将使我们能够更好地沟通。

FWIW，我不明白你如何在你突出显示的行上获得NPE。你确定你有合适的路线吗？

编辑（12/12）：

以下processElement()方法应该可以将结果转换为Put：

@Override
public void processElement(DoFn<Result, Mutation>.ProcessContext c) throws Exception {
  Result result = c.element();
  byte[] row = result.getRow();
  if (row != null && row.length > 0) {
    Put put = new Put(row);
    for (Cell cell : result.rawCells()) {
      put.add(cell);
    }
    c.output(put);
  }
}

从Bigtable到GCS（反之亦然）通过Dataflow

1 个答案: