尝试在~800.000文件上运行大型转换时,在尝试运行管道时出现上述错误消息。
以下是代码:
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
GcsUtil u = getUtil(p.getOptions());
try{
List<GcsPath> paths = u.expand(GcsPath.fromUri("gs://tlogdataflow/stage/*.zip"));
List<String> strPaths = new ArrayList<String>();
for(GcsPath pa: paths){
strPaths.add(pa.toUri().toString());
}
p.apply(Create.of(strPaths))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(IOException io){
//
}
}
我认为这正是google数据流的用途?处理大量文件/数据?
有没有办法分割负载以使其工作?
谢谢&amp; BR
菲尔
答案 0 :(得分:3)
Dataflow擅长处理大量数据,但在管道的描述范围方面存在局限性。传递给Create.of()
的数据当前嵌入在管道描述中,因此您无法在其中传递大量数据 - 相反,应从外部存储中读取大量数据,并且管道应仅指定其位置
将其视为程序可以处理的数据量与程序代码本身的大小之间的区别。
您可以通过ParDo
:
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFn()))
.apply(...fusion break (see below)...)
.apply(Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")))
其中ExpandFn
如下所示:
private static class ExpandFn extends DoFn<String, String> {
@ProcessElement
public void process(ProcessContext c) {
GcsUtil util = getUtil(c.getPipelineOptions());
for (String path : util.expand(GcsPath.fromUri(c.element()))) {
c.output(path);
}
}
}
和融合中断我指的是this(基本上,ParDo(add unique key)
+ group by key
+ Flatten.iterables()
+ Values.create()
)。这不是很方便,并且正在讨论添加内置转换来执行此操作(请参阅this PR和this thread)。
答案 1 :(得分:1)
非常感谢!使用你的输入我解决了这个问题:
public class ZipPipeline {
private static final Logger LOG = LoggerFactory.getLogger(ZipPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
try{
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFN()))
.apply(ParDo.of(new AddKeyFN()))
.apply(GroupByKey.<String,String>create())
.apply(ParDo.of(new FlattenFN()))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(Exception e){
LOG.error(e.getMessage());
}
}
private static class FlattenFN extends DoFn<KV<String,Iterable<String>>, String>{
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c){
KV<String,Iterable<String>> kv = c.element();
for(String s: kv.getValue()){
c.output(s);
}
}
}
private static class ExpandFN extends DoFn<String,String>{
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c) throws Exception{
GcsUtil u = getUtil(c.getPipelineOptions());
for(GcsPath path : u.expand(GcsPath.fromUri(c.element()))){
c.output(path.toUri().toString());
}
}
}
private static class AddKeyFN extends DoFn<String, KV<String,String>>{
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c){
String path = c.element();
String monthKey = path.split("_")[4].substring(0, 6);
c.output(KV.of(monthKey, path));
}
}