Google Dataflow:请求有效负载大小超出限制:10485760字节

时间:2016-11-16 08:49:27

标签: google-cloud-dataflow

尝试在~800.000文件上运行大型转换时,在尝试运行管道时出现上述错误消息。

以下是代码:

public static void main(String[] args) {
Pipeline p = Pipeline.create(
    PipelineOptionsFactory.fromArgs(args).withValidation().create());    
    GcsUtil u = getUtil(p.getOptions());

    try{
        List<GcsPath> paths = u.expand(GcsPath.fromUri("gs://tlogdataflow/stage/*.zip"));
        List<String> strPaths = new ArrayList<String>();
        for(GcsPath pa: paths){
            strPaths.add(pa.toUri().toString());
        }           

        p.apply(Create.of(strPaths))
         .apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
        p.run();
    }
    catch(IOException io){
        //
    }

}

我认为这正是google数据流的用途?处理大量文件/数据?

有没有办法分割负载以使其工作?

谢谢&amp; BR

菲尔

2 个答案:

答案 0 :(得分:3)

Dataflow擅长处理大量数据,但在管道的描述范围方面存在局限性。传递给Create.of()的数据当前嵌入在管道描述中,因此您无法在其中传递大量数据 - 相反,应从外部存储中读取大量数据,并且管道应仅指定其位置

将其视为程序可以处理的数据量与程序代码本身的大小之间的区别。

您可以通过ParDo

进行扩展来解决此问题
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
 .apply(ParDo.of(new ExpandFn()))
 .apply(...fusion break (see below)...)
 .apply(Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")))

其中ExpandFn如下所示:

private static class ExpandFn extends DoFn<String, String> {
  @ProcessElement
  public void process(ProcessContext c) {
    GcsUtil util = getUtil(c.getPipelineOptions());
    for (String path : util.expand(GcsPath.fromUri(c.element()))) {
      c.output(path);
    }
  }
}

融合中断我指的是this(基本上,ParDo(add unique key) + group by key + Flatten.iterables() + Values.create() )。这不是很方便,并且正在讨论添加内置转换来执行此操作(请参阅this PRthis thread)。

答案 1 :(得分:1)

非常感谢!使用你的输入我解决了这个问题:

public class ZipPipeline {
private static final Logger LOG = LoggerFactory.getLogger(ZipPipeline.class);

public static void main(String[] args) {
Pipeline p = Pipeline.create(
    PipelineOptionsFactory.fromArgs(args).withValidation().create());    

    try{
        p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
         .apply(ParDo.of(new ExpandFN()))
         .apply(ParDo.of(new AddKeyFN()))
         .apply(GroupByKey.<String,String>create())
         .apply(ParDo.of(new FlattenFN()))
         .apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
        p.run();

    }
    catch(Exception e){
        LOG.error(e.getMessage());
    }

}

private static class FlattenFN extends DoFn<KV<String,Iterable<String>>, String>{
  private static final long serialVersionUID = 1L;
  @Override
  public void processElement(ProcessContext c){
      KV<String,Iterable<String>> kv = c.element();
      for(String s: kv.getValue()){
          c.output(s);
      }


      }

  }

private static class ExpandFN extends DoFn<String,String>{
private static final long serialVersionUID = 1L;

@Override
  public void processElement(ProcessContext c) throws Exception{
      GcsUtil u = getUtil(c.getPipelineOptions());
      for(GcsPath path : u.expand(GcsPath.fromUri(c.element()))){
          c.output(path.toUri().toString());
      }
  }
}

private static class AddKeyFN extends DoFn<String, KV<String,String>>{
  private static final long serialVersionUID = 1L;
  @Override
  public void processElement(ProcessContext c){
     String path = c.element();
     String monthKey = path.split("_")[4].substring(0, 6);
     c.output(KV.of(monthKey, path));
  }
}