我正在处理一个梁管道来处理一个json并将其写入bigquery。 JSON就是这样的。
{
"message": [{
"name": "abc",
"itemId": "2123",
"itemName": "test"
}, {
"name": "vfg",
"itemId": "56457",
"itemName": "Chicken"
}],
"publishDate": "2017-10-26T04:54:16.207Z"
}
我使用Jackson将其解析为以下结构。
class Feed{
List<Message> messages;
TimeStamp publishDate;
}
public class Message implements Serializable{
/**
*
*/
private static final long serialVersionUID = 1L;
private String key;
private String value;
private Map<String, String> eventItemMap = new HashMap<>();
this property translate the list of map as a single map with all the key-value pair together. because, the messages property will be parsed as list of HashMap objets for each key/value. This will be translated to a single map.
现在在我的管道中,我将把集合转换为
PCollection<KV<String, Feed>>
根据类中的属性将其写入不同的表。我写了一个转换来做到这一点。 要求是根据消息对象的数量创建多个TableRows。我在JSON中还有一些属性以及publishDate,它将被添加到tableRow和每个消息属性中。 所以表格如下。
id, name, field1, field2, message1.property1, message1.property2...
id, name, field1, field2, message2.property1, message2.property2...
我尝试创建以下转换。但是,不确定它将如何根据消息列表输出多行。
private class BuildRowListFn extends DoFn<KV<String, Feed>, List<TableRow>> {
@ProcessElement
public void processElement(ProcessContext context) {
Feed feed = context.element().getValue();
List<Message> messages = feed.getMessage();
List<TableRow> rows = new ArrayList<>();
messages.forEach((message) -> {
TableRow row = new TableRow();
row.set("column1", feed.getPublishDate());
row.set("column2", message.getEventItemMap().get("key1"));
row.set("column3", message.getEventItemMap().get("key2"));
rows.add(row);
}
);
}
但是,这也是一个List,我将无法应用BigQueryIO.write转换。
谢谢@jkff。现在,我已经更改了第二段中提到的代码。在将表行设置为
之后,messages.forEach中的context.output(行)List<Message> messages = feed.getMessage();
messages.forEach((message) -> {
TableRow row = new TableRow();
row.set("column2", message.getEventItemMap().get("key1"));
context.output(row);
}
现在,当我尝试将此集合写入BigQuery时,如
rows.apply(BigQueryIO.writeTableRows().to(getTable(projectId, datasetId, tableName)).withSchema(getSchema())
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
我收到以下异常。
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NullPointerException
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:331)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:301)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
at com.chefd.gcloud.analytics.pipeline.MyPipeline.main(MyPipeline.java:284)
Caused by: java.lang.NullPointerException
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:759)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:809)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:126)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:96)
请帮忙。
谢谢。
答案 0 :(得分:1)
您似乎假设DoFn
每个元素只能输出一个值。情况并非如此:它可以为每个元素输出任意数量的值 - 没有值,一个值,多个值等。DoFn
甚至可以output values to multiple PCollection's。
在您的情况下,您只需为c.output(row)
方法中的每一行调用@ProcessElement
,例如:rows.forEach(c::output)
。当然,您还需要将DoFn
的类型更改为DoFn<KV<String, Feed>, TableRow>
,因为其输出PCollection
中的元素类型为TableRow
,而不是List<TableRow>
- 你只是为每个输入元素在集合中生成多个元素,但这不会改变类型。
另一种方法是执行您当前所做的操作,同时执行c.output(rows)
然后应用Flatten.iterables()
将PCollection<List<TableRow>>
展平为PCollection<TableRow>
(您可能需要将List
替换为Iterable
以使其生效。但另一种方法更容易。