Question

我想在gcp项目A上运行数据流批处理作业。管道的源是来自另一个项目的数据存储。 Pipeline与DirectPipelineRunner一起使用，但是当我切换到DataflowPipelineRunner时，我收到错误：请求失败，代码为403，不会重试：https://www.googleapis.com/datastore/v1beta2/datasets/projectb/runQuery。这样做的正确方法是什么？

我将服务帐户从项目A添加到项目B.还提供了来自服务帐户证书的选项凭据。

管道代码：

public class Sample {
    public static void main(String[] args) throws Exception {
        DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
        //options.setRunner(DirectPipelineRunner.class);
        options.setRunner(DataflowPipelineRunner.class);
        options.setProject("project_a");
        // Your Google Cloud Storage path for staging local files.
        options.setStagingLocation("gs://project_a_folder/staging");
        options.setGcpCredential(
                DatastoreHelper.getServiceAccountCredential(
                    "project_a@developer.gserviceaccount.com",
                        SecurityUtils.loadPrivateKeyFromKeyStore(
                                SecurityUtils.getPkcs12KeyStore(),
                                Sample.class.getClass().getResourceAsStream(
                                        "/projecta-0450c49cbddc.p12"),
                                "notasecret",
                                "privatekey",
                                "notasecret"),
                        Arrays.asList(
                                "https://www.googleapis.com/auth/cloud-platform",
                                "https://www.googleapis.com/auth/devstorage.full_control",
                                "https://www.googleapis.com/auth/userinfo.email",
                                "https://www.googleapis.com/auth/datastore")));

        Pipeline pipeline = Pipeline.create(options);

        DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder();
        q.addKindBuilder().setName("Entity");
        q.setFilter(DatastoreHelper.makeFilter("property",
                DatastoreV1.PropertyFilter.Operator.EQUAL,
                DatastoreHelper.makeValue("somevalue")));

        List<TableFieldSchema> fields = new ArrayList<>();
        fields.add(new TableFieldSchema().setName("f1").setType("STRING"));
        fields.add(new TableFieldSchema().setName("f2").setType("STRING"));
        TableSchema tableSchema = new TableSchema().setFields(fields);

        pipeline.apply(DatastoreIO.readFrom("projectb", q.build()))
                .apply(ParDo.of(new DoFn<DatastoreV1.Entity, KV<String, String>>() {
                    @Override
                    public void processElement(ProcessContext c) throws Exception {
                        try {
                            Map<String, DatastoreV1.Value> propertyMap = DatastoreHelper.getPropertyMap(c.element());
                            String p1 = DatastoreHelper.getString(propertyMap.get("p1"));
                            String p2 = DatastoreHelper.getString(propertyMap.get("p2"));
                            if (!Strings.isNullOrEmpty(p1)) {
                                c.output(KV.of(p2, p1));
                            }
                        } catch (Exception e) {
                            log.log(Level.SEVERE, "Failed to output entity data", e);
                        }
                    }
                }))
                .apply(ParDo.of(new DoFn<KV<String, String>, TableRow>() {
                    @Override
                    public void processElement(ProcessContext c) throws Exception {
                        TableRow tableRow = new TableRow();
                        tableRow.set("f1", c.element().getKey());
                        tableRow.set("f2", c.element().getValue());
                        c.output(tableRow);
                    }
                }))
                .apply(BigQueryIO.Write.to("dataset.table")
                        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
                        .withSchema(tableSchema));

        pipeline.run();
    }
}

Answer 1

如果项目A的服务帐户被添加为项目B的管理员，并且两个项目都启用了Cloud Datastore API，那么从Dataflow跨项目数据存储访问应该有效。

我认为您不需要执行任何手动凭据，Dataflow应自动作为项目A的服务帐户运行。

来自另一个项目的数据存储源

1 个答案: