基于WordCount示例,我试图读取自己的json数据(而不是shakespear txts)。
我正在运行管道:
mvn compile exec:java -Dexec.mainClass=myPkg.myClass -Dexec.args=" \
--project=myProj \
--stagingLocation=gs://myBkt/stage \
--runner=BlockingDataflowPipelineRunner \
--output=gs://myBkt/output/out \
--defaultWorkerLogLevel=DEBUG"
控制台的输出如下:
<date> com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 68 files. Enable logging at DEBUG level to see which files will be staged.
<date> myPkg$GroupPublished apply
<date> myPkg$GroupPublished apply
INFO: GroupPublished/JsonToDatePosPlatKeyFn.out [PCollection]
<date> myPkg main
主要
static void main(String[] args) {
...
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadJson").from(options.getInputFile()))
.apply(new GroupPublished())
.apply(ParDo.of(new FormatAsStringFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()));
}
GroupPublished transformation
static class GroupPublished extends PTransform<PCollection<String>,
PCollection<KV<DatePosPlatKey, Long>>> {
@Override
public PCollection<KV<DatePosPlatKey, Long>> apply(PCollection<String> lines) {
PCollection<DatePosPlatKey> keyList
= lines.apply(ParDo.of(new JsonToDatePosPlatKeyFn()));
PCollection<KV<DatePosPlatKey, Long>> keysCounted =
keyList.apply(Count.<DatePosPlatKey>perElement());
return keysCounted;
}
}
json行处理
static class JsonToDatePosPlatKeyFn extends DoFn<String, DatePosPlatKey>{
@Override
public void processElement(ProcessContext c) throws Exception {
JsonNode root = mapper.readTree(c.element());
for (JsonNode jsonFact : root) {
DatePosPlatKey key = new DatePosPlatKey(...construct...);
...manipulate...
c.output(key);
}
}
}
数据类
@DefaultCoder(AvroCoder.class)
public static class DatePosPlatKey { ... }
到目前为止我已检查过的内容:
defaultWorkerLogLevel
似乎没有对控制台输出产生任何影响{...}\n{...}\n...
如何更好地调试完全缺乏数据? 你能看出我做错了吗?
答案 0 :(得分:0)
在离线讨论时,结果发现代码缺少对p.run()
的调用,因此管道仅构建但未执行。