Question

当前，我们正在寻找将原始数据转换为通用结构以进行进一步分析的最佳方法。我们的数据是JSON文件，有些文件具有更多字段，有些文件较少，有些文件可能具有数组，但总的来说，它的结构大致相同。

为此，我正在尝试用Java构建Apache Beam管道。我所有的管道均基于以下模板：TextIOToBigQuery.java

第一种方法是将整个JSON作为字符串加载到一列中，然后使用JSON Functions in Standard SQL转换为通用结构。此处对此进行了很好的描述：How to manage/handle schema changes while loading JSON file into BigQuery table

第二种方法是将数据加载到适当的列中。因此，现在可以通过标准SQL查询数据。它还需要知道架构。可以通过控制台，UI和其他Using schema auto-detection来检测它，但是我没有找到关于如何通过Java和Apache Beam管道实现此目标的任何信息。

我分析了BigQueryIO，看起来没有模式它就无法工作（只有一个例外，如果已经创建了表）

如前所述，新文件可能会带来新的字段，因此应该相应地更新架构。

假设我有三个JSON文件：

1. { "field1": "value1" }
2. { "field2": "value2" }
3. { "field1": "value3", "field10": "value10" }

第一个创建一个带有字符串类型字段“ field1”的新表。所以我的桌子应该像这样：

|field1  |
----------
|"value1"|

第二个操作相同，但是添加新字段“ field2”。现在我的桌子应该像这样：

|field1  |field2  |
-------------------
|"value1"|null    |
-------------------
|null    |"value2"|

第三个JSON应该在架构中添加另一个字段“ field10”，依此类推。实际的JSON文件可能包含200个字段或更多。处理这种情况有多难？

哪种方法更好地进行这种转换？

Answer 1

我做了一些测试，以模拟典型的自动检测模式：首先，我遍历所有数据以构建所有可能字段和类型的Map（此处我只是考虑了String或Integer（为简单起见）。我使用stateful管道跟踪已经看到的字段并将其另存为PCollectionView。通过这种方式，我可以使用.withSchemaFromView()，因为架构在管道构建时是未知的。请注意，这种方法仅对批处理作业有效。

首先，我创建了一些没有严格模式的伪数据，其中每行可能包含也可能不包含任何字段：

PCollection<KV<Integer, String>> input = p
  .apply("Create data", Create.of(
        KV.of(1, "{\"user\":\"Alice\",\"age\":\"22\",\"country\":\"Denmark\"}"),
        KV.of(1, "{\"income\":\"1500\",\"blood\":\"A+\"}"),
        KV.of(1, "{\"food\":\"pineapple pizza\",\"age\":\"44\"}"),
        KV.of(1, "{\"user\":\"Bob\",\"movie\":\"Inception\",\"income\":\"1350\"}"))
  );

我们将读取输入数据，并构建一个Map，其中包含我们在数据中看到的不同字段名称，并进行基本类型检查以确定其是否包含INTEGER或{{1 }}。当然，如果需要，可以扩展它。请注意，之前创建的所有数据都分配给了相同的键，以便将它们分组在一起，我们可以构建字段的完整列表，但这可能是性能瓶颈。我们实现输出，以便可以将其用作侧面输入：

STRING

现在，我们可以使用之前的PCollectionView<Map<String, String>> schemaSideInput = input .apply("Build schema", ParDo.of(new DoFn<KV<Integer, String>, KV<String, String>>() { // A map containing field-type pairs @StateId("schema") private final StateSpec<ValueState<Map<String, String>>> schemaSpec = StateSpecs.value(MapCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())); @ProcessElement public void processElement(ProcessContext c, @StateId("schema") ValueState<Map<String, String>> schemaSpec) { JSONObject message = new JSONObject(c.element().getValue()); Map<String, String> current = firstNonNull(schemaSpec.read(), new HashMap<String, String>()); // iterate through fields message.keySet().forEach(key -> { Object value = message.get(key); if (!current.containsKey(key)) { String type = "STRING"; try { Integer.parseInt(value.toString()); type = "INTEGER"; } catch(Exception e) {} // uncomment if debugging is needed // LOG.info("key: "+ key + " value: " + value + " type: " + type); c.output(KV.of(key, type)); current.put(key, type); schemaSpec.write(current); } }); } })).apply("Save as Map", View.asMap());来构建包含BigQuery表架构的Map：

PCollectionView

相应地更改标准表名PCollectionView<Map<String, String>> schemaView = p .apply("Start", Create.of("Start")) .apply("Create Schema", ParDo.of(new DoFn<String, Map<String, String>>() { @ProcessElement public void processElement(ProcessContext c) { Map<String, String> schemaFields = c.sideInput(schemaSideInput); List<TableFieldSchema> fields = new ArrayList<>(); for (Map.Entry<String, String> field : schemaFields.entrySet()) { fields.add(new TableFieldSchema().setName(field.getKey()).setType(field.getValue())); // LOG.info("key: "+ field.getKey() + " type: " + field.getValue()); } TableSchema schema = new TableSchema().setFields(fields); String jsonSchema; try { jsonSchema = Transport.getJsonFactory().toString(schema); } catch (IOException e) { throw new RuntimeException(e); } c.output(ImmutableMap.of("PROJECT_ID:DATASET_NAME.dynamic_bq_schema", jsonSchema)); }}).withSideInputs(schemaSideInput)) .apply("Save as Singleton", View.asSingleton());。

最后，在我们的管道中，我们读取数据，将其转换为PROJECT_ID:DATASET_NAME.dynamic_bq_schema，然后使用TableRow将其写入BigQuery：

.withSchemaFromView(schemaView)

完整代码here。

管道创建的BigQuery表架构：

以及产生的稀疏数据：

Answer 2

如果您的数据是根据模式（avro，protobuf等）序列化的，则可以在流作业中创建/更新表模式。从这种意义上说，它是预定义的，但仍在处理过程中更新表模式。

使用模式自动检测将数据流作业写入BigQuery

2 个答案: