我有一个Apache BEAM管道,用于处理来自Google pubsub主题的流数据并写入Google数据存储区。过去几天出现了错误,并显示以下错误消息,并阻塞了管道,导致我们丢失了数据。
com.google.datastore.v1.client.DatastoreException: A non-transactional commit may not contain multiple mutations affecting the same entity., code=INVALID_ARGUMENT
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1326)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.finishBundle(DatastoreV1.java:1291)
管道处于流模式,不以任何方式批处理或窗口化数据。因此,我不确定管道同时写入重复记录的可能性。想检查一下:
BEAM管道代码如下:
public class JobPipeline {
private final Pipeline pipeline;
private final JobOptions options;
JobPipeline(JobOptions options) {
this.options = options;
this.pipeline = Pipeline.create(options);
}
void run() throws IOException {
PTransform<PBegin, PCollection<String>> input = getInput([pubsub topic]);
PCollection<KV<String, EnrichedData>> enrichedData = new EnrichmentPipeline(options, input).apply(pipeline);
pipeline.run();
}
}
public class EnrichmentPipeline {
private final JobOptions options;
private final PTransform<PBegin, PCollection<String>> input;
public EnrichmentPipeline(JobOptions options,
PTransform<PBegin, PCollection<String>> input) {
this.options = options;
this.input = input;
}
public PCollection<KV<String, EnrichedData>> apply(final Pipeline pipeline) throws IOException {
PCollection<KV<String, EnrichedData>> enrichedData = pipeline.apply("Reading Data", input)
.apply("Transforming Json to Data", ParDo.of(new JsonToData()))
.apply("Enrichment", ParDo.of(new EnrichmentFn(options.getProjectId(), options.getReferenceKind())));
writeIntoDataStore(options.getProjectId(), enrichedData, new EnrichedDataToEntityFn(options.getDataKind()));
return enrichedData;
}
}