GCP Dataflow管道没有正在读取/处理的json行

时间:2016-11-07 09:55:32

标签: json google-cloud-platform google-cloud-dataflow

基于WordCount示例,我试图读取自己的json数据(而不是shakespear txts)。

我正在运行管道:

mvn compile exec:java  -Dexec.mainClass=myPkg.myClass  -Dexec.args=" \
 --project=myProj \
 --stagingLocation=gs://myBkt/stage \
 --runner=BlockingDataflowPipelineRunner \
 --output=gs://myBkt/output/out \
 --defaultWorkerLogLevel=DEBUG"

控制台的输出如下:

<date> com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 68 files. Enable logging at DEBUG level to see which files will be staged.
<date> myPkg$GroupPublished apply
<date> myPkg$GroupPublished apply
INFO: GroupPublished/JsonToDatePosPlatKeyFn.out [PCollection]
<date> myPkg main

主要

static void main(String[] args) {
    ...

    Pipeline p = Pipeline.create(options);
    p.apply(TextIO.Read.named("ReadJson").from(options.getInputFile())) 
        .apply(new GroupPublished())
        .apply(ParDo.of(new FormatAsStringFn()))
        .apply(TextIO.Write.named("WriteCounts").to(options.getOutput()));
}

GroupPublished transformation

static class GroupPublished extends PTransform<PCollection<String>,
        PCollection<KV<DatePosPlatKey, Long>>> {
    @Override
    public PCollection<KV<DatePosPlatKey, Long>> apply(PCollection<String> lines) {
        PCollection<DatePosPlatKey> keyList
                = lines.apply(ParDo.of(new JsonToDatePosPlatKeyFn()));

        PCollection<KV<DatePosPlatKey, Long>> keysCounted =
                keyList.apply(Count.<DatePosPlatKey>perElement());

        return keysCounted;
    }
}

json行处理

static class JsonToDatePosPlatKeyFn extends DoFn<String, DatePosPlatKey>{
    @Override
    public void processElement(ProcessContext c) throws Exception {
        JsonNode root = mapper.readTree(c.element());
        for (JsonNode jsonFact : root) {
            DatePosPlatKey key = new DatePosPlatKey(...construct...);
            ...manipulate...
            c.output(key);
        }
    }
}

数据类

@DefaultCoder(AvroCoder.class)
public static class DatePosPlatKey { ... }

到目前为止我已检查过的内容:

  • 添加defaultWorkerLogLevel似乎没有对控制台输出产生任何影响
  • 重命名json文件会引发错误,因此我知道它已被TextIO看到
  • json文件的格式为:{...}\n{...}\n...
  • Google云端控制台中未显示任何日志记录或数据流作业

如何更好地调试完全缺乏数据? 你能看出我做错了吗?

1 个答案:

答案 0 :(得分:0)

在离线讨论时,结果发现代码缺少对p.run()的调用,因此管道仅构建但未执行。