Question

我是Apache Beam的新手。根据我们的要求，我需要传递一个包含5至10条JSON记录的JSON文件作为输入，并逐行从文件中读取此JSON数据并存储到BigQuery中。谁能帮我提供下面的示例代码，该示例代码尝试使用apache beam读取JSON数据：

PCollection<String> lines = 
    pipeline
      .apply("ReadMyFile", 
             TextIO.read()
                   .from("C:\\Users\\Desktop\\test.json")); 
if(null!=lines) { 
  PCollection<String> words =
     lines.apply(ParDo.of(new DoFn<String, String>() { 
        @ProcessElement
        public void processElement(ProcessContext c) { 
          String line = c.element();
        }
      })); 
  pipeline.run(); 
}

Answer 1

答案取决于它。

TextIO逐行读取文件。因此，在您的test.json中，每行都需要包含一个单独的Json对象。

您拥有的ParDo将一一收到这些行，即，对@ProcessElement的每次调用都会得到一行。

然后在您的ParDo中，您可以使用杰克逊ObjectMapper之类的东西从行中解析Json（或您熟悉的任何其他Json解析器，但是Jackson被广泛使用，包括很少的地方在Beam本身中。

编写ParDo的总体方法是：

获取c.element();
为c.element()的值做一些事情，例如从json将其解析为Java对象；
将您对c.element()所做的操作的结果发送到c.output()；

我建议从Beam SDK的Jackson扩展开始，它会添加PTransforms来做到这一点，请参见this和this。

也请看看this帖子，其中包含一些链接。

还有JsonToRow transform，您可以寻找类似的逻辑，不同之处在于它不是将Json解析为用户定义的Java对象，而是解析为Beam Row类。

在写BQ之前，您需要将从Json解析的对象转换为BQ行，这将在解析逻辑之后是另外的ParDo，然后是actually apply the BQIO。您可以在BQ test中看到一些示例。

Answer 2

假设我们在文件中有一个json字符串，如下所示，

{"col1":"sample-val-1", "col2":1.0}
{"col1":"sample-val-2", "col2":2.0}
{"col1":"sample-val-3", "col2":3.0}
{"col1":"sample-val-4", "col2":4.0}
{"col1":"sample-val-5", "col2":5.0}

为了通过DataFlow / Beam将这些值从文件存储到BigQuery，您可能必须遵循以下步骤，

定义一个TableReference来引用BigQuery表。
为要存储的每一列定义TableFieldSchema。
使用TextIO.read（）读取文件。
创建一个DoFn将Json字符串解析为TableRow格式。
使用BigQueryIO提交TableRow对象。

您可以参考以下有关上述步骤的代码段，

用于创建TableReference和TableFieldSchema，

TableReference tableRef = new TableReference();
tableRef.setProjectId("project-id");
tableRef.setDatasetId("dataset-name");
tableRef.setTableId("table-name");

List<TableFieldSchema> fieldDefs = new ArrayList<>();
fieldDefs.add(new TableFieldSchema().setName("column1").setType("STRING"));
fieldDefs.add(new TableFieldSchema().setName("column2").setType("FLOAT"));

对于管道步骤，

Pipeline pipeLine = Pipeline.create(options);
pipeLine
.apply("ReadMyFile", 
        TextIO.read().from("path-to-json-file")) 

.apply("MapToTableRow", ParDo.of(new DoFn<String, TableRow>() {
    @ProcessElement
    public void processElement(ProcessContext c) { 
        Gson gson = new GsonBuilder().create();
        HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class);

        TableRow row = new TableRow();
        row.set("column1", parsedMap.get("col1").toString());
        row.set("column2", Double.parseDouble(parsedMap.get("col2").toString()));
        c.output(row);
    }
}))

.apply("CommitToBQTable", BigQueryIO.writeTableRows()
        .to(tableRef)
        .withSchema(new TableSchema().setFields(fieldDefs))
        .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
        .withWriteDisposition(WriteDisposition.WRITE_APPEND));

pipeLine.run();

BigQuery表可能如下所示，

如何在Java中使用Apache Beam ParDo函数读取JSON文件

2 个答案: